433751_Print Guide To Convolutional Neural Networks A Practical Application Traffic Sign Detection And Classificat

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 303

Download433751_Print Guide To Convolutional Neural Networks A Practical Application Traffic-Sign Detection And Classificat
Open PDF In BrowserView PDF
Hamed Habibi Aghdam
Elnaz Jahani Heravi

Guide to
Convolutional
Neural
Networks
A Practical Application to Traffic-Sign
Detection and Classification

Guide to Convolutional Neural Networks

Hamed Habibi Aghdam
Elnaz Jahani Heravi

Guide to Convolutional
Neural Networks
A Practical Application to Traffic-Sign
Detection and Classification

123

Elnaz Jahani Heravi
University Rovira i Virgili
Tarragona
Spain

Hamed Habibi Aghdam
University Rovira i Virgili
Tarragona
Spain

ISBN 978-3-319-57549-0
DOI 10.1007/978-3-319-57550-6

ISBN 978-3-319-57550-6

(eBook)

Library of Congress Control Number: 2017938310
© Springer International Publishing AG 2017
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To my wife, Elnaz, who possess the most
accurate and reliable optimization method
and guides me toward global optima of life.
Hamed Habibi Aghdam

Preface

General paradigm in solving a computer vision problem is to represent a raw image
using a more informative vector called feature vector and train a classifier on top of
feature vectors collected from training set. From classification perspective, there are
several off-the-shelf methods such as gradient boosting, random forest and support
vector machines that are able to accurately model nonlinear decision boundaries.
Hence, solving a computer vision problem mainly depends on the feature extraction
algorithm.
Feature extraction methods such as scale invariant feature transform, histogram
of oriented gradients, bank of Gabor filters, local binary pattern, bag of features and
Fisher vectors are some of the methods that performed well compared with their
predecessors. These methods mainly create the feature vector in several steps. For
example, scale invariant feature transform and histogram of oriented gradients first
compute gradient of the image. Then, they pool gradient magnitudes over different
regions and concatenate them in order to create the final feature vector. Similarly,
bag of feature and Fisher vectors start with extracting a feature vector such as
histogram of oriented gradient on regions around bunch salient points on image.
Then, these features are pooled again in order to create higher level feature vectors.
Despite the great efforts in computer vision community, the above
hand-engineered features were not able to properly model large classes of natural
objects. Advent of convolutional neural networks, large datasets and parallel
computing hardware changed the course of computer vision. Instead of designing
feature vectors by hand, convolutional neural networks learn a composite feature
transformation function that makes classes of objects linearly separable in the
feature space.
Recently, convolutional neural networks have surpassed human in different tasks
such as classification of natural objects and classification of traffic signs. After their
great success, convolutional neural networks have become the first choice for
learning features from training data.
One of the fields that have been greatly influenced by convolutional neural
networks is automotive industry. Tasks such as pedestrian detection, car detection,
traffic sign recognition, traffic light recognition and road scene understanding are
rarely done using hand-crafted features anymore.

vii

viii

Preface

Designing, implementing and evaluating are crucial steps in developing a successful computer vision-based method. In order to design a neural network, one
must have the basic knowledge about the underlying process of neural network and
training algorithms. Implementing a neural network requires a deep knowledge
about libraries that can be used for this purpose. Moreover, neural network must be
evaluated quantitatively and qualitatively before using them in practical
applications.
Instead of going into details of mathematical concepts, this book tries to adequately explain fundamentals of neural network and show how to implement and
assess them in practice. Specifically, Chap. 2 covers basic concepts related to
classification and it derives the idea of feature learning using neural network starting
from linear classifiers. Then, Chap. 3 shows how to derive convolutional neural
networks from fully connected neural networks. It also reviews classical network
architectures and mentions different techniques for evaluating neural networks.
Next, Chap. 4 thoroughly talks about a practical library for implementing convolutional neural networks. It also explains how to use Python interface of this
library in order to create and evaluate neural networks. The next two chapters
explain practical examples about detection and classification of traffic signs using
convolutional neural networks. Finally, the last chapter introduces a few techniques
for visualizing neural networks using Python interface.
Graduate/undergraduate students as well as machine vision practitioners can use
the book to gain a hand-on knowledge in the field of convolutional neural networks.
Exercises have been designed such that they will help readers to acquire deeper
knowledge in the field. Last but not least, Python scripts have been provided so
reader will be able to reproduce the results and practice the topics of this book
easily.

Books Website
Most of codes explained in this book are available in https://github.com/pcnn/. The
codes are written in Python 2.7 and they require numpy and matplotlib libraries.
You can download and try the codes on your own.
Tarragona, Spain

Hamed Habibi Aghdam

Contents

1 Traffic Sign Detection and Recognition .
1.1 Introduction . . . . . . . . . . . . . . . .
1.2 Challenges . . . . . . . . . . . . . . . . .
1.3 Previous Work . . . . . . . . . . . . . .
1.3.1 Template Matching . . . . . .
1.3.2 Hand-Crafted Features . . . .
1.3.3 Feature Learning . . . . . . . .
1.3.4 ConvNets . . . . . . . . . . . . .
1.4 Summary . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

1
1
2
5
5
5
7
10
12
12

2 Pattern Classification . . . . . . . . . . . . . . . .
2.1 Formulation . . . . . . . . . . . . . . . . . .
2.1.1 K-Nearest Neighbor . . . . . . . .
2.2 Linear Classifier . . . . . . . . . . . . . . .
2.2.1 Training a Linear Classifier. . .
2.2.2 Hinge Loss . . . . . . . . . . . . . .
2.2.3 Logistic Regression . . . . . . . .
2.2.4 Comparing Loss Function. . . .
2.3 Multiclass Classification . . . . . . . . . .
2.3.1 One Versus One . . . . . . . . . .
2.3.2 One Versus Rest . . . . . . . . . .
2.3.3 Multiclass Hinge Loss . . . . . .
2.3.4 Multinomial Logistic Function
2.4 Feature Extraction . . . . . . . . . . . . . .
2.5 Learning UðxÞ . . . . . . . . . . . . . . . . .
2.6 Artificial Neural Networks . . . . . . . .
2.6.1 Backpropagation . . . . . . . . . .
2.6.2 Activation Functions . . . . . . .
2.6.3 Role of Bias . . . . . . . . . . . . .
2.6.4 Initialization . . . . . . . . . . . . .
2.6.5 How to Apply on Images . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

15
16
17
20
22
30
34
37
41
41
44
46
48
51
58
61
65
71
78
79
79

ix

x

Contents

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . .
3.1 Deriving Convolution from a Fully Connected Layer .
3.1.1 Role of Convolution . . . . . . . . . . . . . . . . . .
3.1.2 Backpropagation of Convolution Layers . . . . .
3.1.3 Stride in Convolution . . . . . . . . . . . . . . . . . .
3.2 Pooling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Backpropagation in Pooling Layer . . . . . . . . .
3.3 LeNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Designing a ConvNet . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 ConvNet Architecture. . . . . . . . . . . . . . . . . .
3.5.2 Software Libraries . . . . . . . . . . . . . . . . . . . .
3.5.3 Evaluating a ConvNet . . . . . . . . . . . . . . . . .
3.6 Training a ConvNet . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . .
3.6.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . .
3.6.3 Regularization . . . . . . . . . . . . . . . . . . . . . . .
3.6.4 Learning Rate Annealing . . . . . . . . . . . . . . .
3.7 Analyzing Quantitative Results . . . . . . . . . . . . . . . .
3.8 Other Types of Layers . . . . . . . . . . . . . . . . . . . . . .
3.8.1 Local Response Normalization . . . . . . . . . . .
3.8.2 Spatial Pyramid Pooling . . . . . . . . . . . . . . . .
3.8.3 Mixed Pooling . . . . . . . . . . . . . . . . . . . . . .
3.8.4 Batch Normalization . . . . . . . . . . . . . . . . . .
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Caffe
4.1
4.2
4.3

Library . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . .
Installing Caffe . . . . . . . . . . . . . . . .
Designing Using Text Files. . . . . . . .
4.3.1 Providing Data . . . . . . . . . . .
4.3.2 Convolution Layers . . . . . . . .
4.3.3 Initializing Parameters . . . . . .
4.3.4 Activation Layer . . . . . . . . . .
4.3.5 Pooling Layer . . . . . . . . . . . .
4.3.6 Fully Connected Layer . . . . . .
4.3.7 Dropout Layer. . . . . . . . . . . .
4.3.8 Classification and Loss Layers

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

81
82
83

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

85
85
90
92
94
95
97
98
100
101
102
103
105
111
112
113
115
121
124
126
126
127
127
127
128
128
129

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

131
131
132
132
137
139
141
142
144
145
146
146

Contents

4.4 Training a Network . . . . . . . . . .
4.5 Designing in Python . . . . . . . . .
4.6 Drawing Architecture of Network
4.7 Training Using Python . . . . . . . .
4.8 Evaluating Using Python . . . . . .
4.9 Save and Restore Networks. . . . .
4.10 Python Layer in Caffe . . . . . . . .
4.11 Summary . . . . . . . . . . . . . . . . .
4.12 Exercises . . . . . . . . . . . . . . . . .
Reference. . . . . . . . . . . . . . . . . . . . . .

xi

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

152
154
157
157
158
161
162
164
164
166

5 Classification of Traffic Signs . . . . . . . . . . . . . . . . . . . . .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Template Matching . . . . . . . . . . . . . . . . . . .
5.2.2 Hand-Crafted Features . . . . . . . . . . . . . . . . .
5.2.3 Sparse Coding. . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.5 ConvNets . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Preparing Dataset. . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Splitting Data . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Augmenting Dataset. . . . . . . . . . . . . . . . . . .
5.3.3 Static Versus One-the-Fly Augmenting. . . . . .
5.3.4 Imbalanced Dataset . . . . . . . . . . . . . . . . . . .
5.3.5 Preparing the GTSRB Dataset . . . . . . . . . . . .
5.4 Analyzing Training/Validation Curves . . . . . . . . . . .
5.5 ConvNets for Classification of Traffic Signs . . . . . . .
5.6 Ensemble of ConvNets . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Combining Models . . . . . . . . . . . . . . . . . . .
5.6.2 Training Different Models. . . . . . . . . . . . . . .
5.6.3 Creating Ensemble. . . . . . . . . . . . . . . . . . . .
5.7 Evaluating Networks . . . . . . . . . . . . . . . . . . . . . . .
5.7.1 Misclassified Images . . . . . . . . . . . . . . . . . .
5.7.2 Cross-Dataset Analysis and Transfer Learning .
5.7.3 Stability of ConvNet . . . . . . . . . . . . . . . . . .
5.7.4 Analyzing by Visualization . . . . . . . . . . . . . .
5.8 Analyzing by Visualizing . . . . . . . . . . . . . . . . . . . .
5.8.1 Visualizing Sensitivity . . . . . . . . . . . . . . . . .
5.8.2 Visualizing the Minimum Perception . . . . . . .
5.8.3 Visualizing Activations. . . . . . . . . . . . . . . . .
5.9 More Accurate ConvNet . . . . . . . . . . . . . . . . . . . . .
5.9.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . .
5.9.2 Stability Against Noise. . . . . . . . . . . . . . . . .
5.9.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

167
167
169
170
170
171
171
172
173
174
177
185
185
187
188
189
199
200
201
202
203
208
209
214
217
217
218
219
220
222
224
226
229

xii

Contents

5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
5.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6 Detecting Traffic Signs . . . . . . . . . . . . . . .
6.1 Introduction . . . . . . . . . . . . . . . . . .
6.2 ConvNet for Detecting Traffic Signs. .
6.3 Implementing Sliding Window Within
6.4 Evaluation . . . . . . . . . . . . . . . . . . .
6.5 Summary . . . . . . . . . . . . . . . . . . . .
6.6 Exercises . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . .

..........
..........
..........
the ConvNet .
..........
..........
..........
..........

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

235
235
236
239
243
246
246
246

7 Visualizing Neural Networks . . . . . .
7.1 Introduction . . . . . . . . . . . . . .
7.2 Data-Oriented Techniques . . . . .
7.2.1 Tracking Activation . . . .
7.2.2 Covering Mask . . . . . . .
7.2.3 Embedding . . . . . . . . . .
7.3 Gradient-Based Techniques . . . .
7.3.1 Activation Maximization
7.3.2 Activation Saliency . . . .
7.4 Inverting Representation . . . . . .
7.5 Summary . . . . . . . . . . . . . . . .
7.6 Exercises . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

247
247
248
248
248
249
249
250
253
254
257
257
258

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

Appendix A: Gradient Descend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

Acronyms

Adagrad
ADAS
ANN
BTSC
CNN
ConvNet
CPU
DAG
ELU
FN
FNN
FP
GD
GPU
GTSRB
HOG
HSI
HSV
KNN
Leaky ReLU
LRN
OVO
OVR
PCA
PPM
PReLU
ReLU
RMSProp
RNN
RReLU
SGD
SNR
SPP

Adaptive gradient
Advanced driver assistant system
Artificial neural network
Belgium traffic sign classification
Convolutional neural network
Convolutional neural network
Central processing unit
Directed acyclic graph
Exponential linear unit
False-negative
Feedforward neural network
False-positive
Gradient descend
Graphic processing unit
German traffic sign recognition benchmark
Histogram of oriented gradients
Hue-saturation-intensity
Hue-saturation-value
K-nearest neighbor
Leaky rectified linear unit
Local response normalization
One versus one
One versus rest
Principal component analysis
Portable pixel map
Parameterized rectified linear unit
Rectified linear unit
Root mean square propagation
Recurrent neural network
Randomized rectified linear unit
Stochastic gradient descend
Signal-to-noise ratio
Spatial pyramid pooling

xiii

xiv

TN
TP
t-SNE
TTC

Acronyms

True-negative
True-positive
t-distributed stochastic neighbor embedding
Time to completion

List of Figures

Fig. 1.1
Fig. 1.2
Fig. 1.3
Fig. 1.4
Fig. 1.5
Fig. 2.1
Fig. 2.2
Fig. 2.3
Fig. 2.4
Fig. 2.5

Fig. 2.6

Fig. 2.7

Fig. 2.8
Fig. 2.9
Fig. 2.10

Common pipeline for recognizing traffic signs . . . . . . . . . .
Some of the challenges in classification of traffic signs.
The signs have been collected in Germany and Belgium . . .
Fine differences between two traffic signs. . . . . . . . . . . . . .
Traditional approach for classification of objects . . . . . . . . .
Dictionary learnt by Aghdam et al. (2015) from 43 classes of
traffic signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A dataset of two-dimensional vectors representing two
classes of objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
K-nearest neighbor looks for the K closets points
in the training set to the query point . . . . . . . . . . . . . . . . .
K-nearest neighbor applied on every point on the plane for
different values of K . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Geometry of linear models . . . . . . . . . . . . . . . . . . . . . . . .
The intuition behind squared loss function is to minimized
the squared difference between the actual response and
predicted value. Left and right plots show two lines with
different w1 and b. The line in the right plot is fitted better
than the line in the left plot since its prediction error is lower
in total. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Status of the gradient descend in four different iterations. The
parameter vector w changes greatly in the first iterations.
However, as it gets closer to the minimum of the squared loss
function, it changes slightly . . . . . . . . . . . . . . . . . . . . . . .
Geometrical intuition behind least square loss function is to
minimize the sum of unnormalized distances between the
training samples xi and their corresponding hypothetical
line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Square loss function may fit inaccurately on training data if
there are noisy samples in the dataset . . . . . . . . . . . . . . . .
The sign function can be accurately approximated
using tanhðkxÞ when k  1 . . . . . . . . . . . . . . . . . . . . . . .
The sign loss function is able to deal with noisy datasets and
separated clusters problem mentioned previously. . . . . . . . .

..

2

..
..
..

4
4
6

..

10

..

17

..

18

..
..

19
21

..

23

..

26

..

27

..

27

..

29

..

30
xv

xvi

Fig. 2.11
Fig. 2.12

Fig. 2.13
Fig. 2.14

Fig. 2.15
Fig. 2.16

Fig. 2.17

Fig. 2.18

Fig. 2.19
Fig. 2.20
Fig. 2.21

Fig. 2.22

Fig. 2.23

Fig. 2.24

List of Figures

Derivative of tanhðkxÞ function saturates as jxj increases.
Also, the ratio of saturation growth rapidly when k [ 1 . . . .
Hinge loss increases the margin of samples while it is
trying to reduce the classification error. Refer to text for
more details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Training a linear classifier using the hinge loss function on
two different datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Plot of the sigmoid function (left) and logarithm of the
sigmoid function (right). The domain of the sigmoid function
is real numbers and its range is ½0; 1. . . . . . . . . . . . . . . . .
Logistic regression is able to deal with separated
clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tanh squared loss and zero-one loss functions are not
convex. In contrast, the squared loss, the hinge loss, and its
variant and the logistic loss functions are convex . . . . . . . .
Logistic regression tries to reduce the logistic loss even after
finding a hyperplane which discriminates the classes
perfectly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using the hinge loss function, the magnitude of w changes
until all the samples are classified correctly and they
do not fall into the critical region . . . . . . . . . . . . . . . . . . .
When classes are not linearly separable, jjwjj may have an
upper bound in logistic loss function . . . . . . . . . . . . . . . . .
A samples dataset including four different classes.
Each class is shown using a unique color and shape . . . . . .
Training six classifiers on the four class classification
problem. One versus one technique considers all unordered
pairs of classes in the dataset and fits a separate binary
classifier on each pair. A input x is classified by computing
the majority of votes produced by each of binary classifiers.
The bottom plot shows the class
of every point on the plane into one of four classes. . . . . . .
One versus rest approach creates a binary dataset by
changing the label of the class-of-interest to 1 and the label
of the other classes to 1. Creating binary datasets is
repeated for all classes. Then, a binary classifier is trained on
each of these datasets. An unseen sample is classified based
on the classification score of the binary classifiers . . . . . . . .
A two-dimensional space divided into four regions using
four linear models fitted using the multiclass hinge loss
function. The plot on the right shows the linear models
(lines in two-dimensional case) in the space . . . . . . . . . . . .
Computational graph of the softmax loss on one sample . . .

..

30

..

31

..

33

..

35

..

37

..

38

..

39

..

40

..

41

..

42

..

43

..

45

..
..

48
51

List of Figures

Fig. 2.25

Fig. 2.26
Fig. 2.27

Fig. 2.28

Fig. 2.29
Fig. 2.30

Fig. 2.31
Fig. 2.32

Fig. 2.33
Fig. 2.34
Fig. 2.35

Fig. 2.36

The two-dimensional space divided into four regions using
four linear models fitted using the softmax loss function. The
plot on the right shows the linear models (lines in
two-dimensional case) in the space . . . . . . . . . . . . . . . . . .
A linear classifier is not able to accurately discriminate
the samples in a nonlinear dataset . . . . . . . . . . . . . . . . . . .
Transforming samples from the original space (left) into
another space (right) by applying on each sample. The
bottom colormaps show how the original space is
transformed using this function . . . . . . . . . . . . . . . . . . . . .
Samples become linearly separable in the new space. As the
result, a linear classifier is able to accurately discriminate
these samples. If we transform the linear model from the new
space into the original space, the linear decision
boundary become a nonlinear boundary . . . . . . . . . . . . . . .
43 classes of traffic in obtained from the GTSRB dataset
(Stallkamp et al. 2012) . . . . . . . . . . . . . . . . . . . . . . . . . .
Weights of a linear model trained directly on raw pixel
intensities can be visualized by reshaping the vectors so they
have the same shape as the input image. Then, each channel
of the reshaped matrix can be shown using a colormap . . . .
Computational graph for (2.78). Gradient of each node
with respect to its parent is shown on the edges . . . . . . . . .
By minimizing (2.78) the model learns to jointly transform
and classify the vectors. The first row shows the distribution
of the training samples in the two-dimensional space. The
second and third rows show the status of the model in three
different iterations starting from the left plots . . . . . . . . . . .
Simplified diagram of a biological neuron . . . . . . . . . . . . .
Diagram of an artificial neuron . . . . . . . . . . . . . . . . . . . . .
A feedforward neural network can be seen as a directed
acyclic graph where the inputs are passed through different
layer until it reaches to the end . . . . . . . . . . . . . . . . . . . . .
Computational graph corresponding to a feedforward
network for classification of three classes. The network
accepts two-dimensional inputs and it has two hidden layers.
The hidden layers consist of four and three neurons,
respectively. Each neuron has two inputs including the
weights and inputs from previous layer. The derivative of
each node with respect to each input is shown on thee
edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xvii

..

52

..

52

..

54

..

55

..

55

..

56

..

60

..
..
..

61
62
63

..

63

..

65

xviii

List of Figures

Fig. 2.37

Fig. 2.38

Forward mode differentiation starts from the end node
to the starting node. At each node, it sums the output edges
of the node where the value of each edge is computed by
multiplying the edge with the derivative of the child node.
Each rectangle with different color and line style shows
which part of the partial derivative is computed until that
point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A sample computational graph with a loss function. To cut
the clutter, activations functions have been fused
with the soma function of the neuron. Also, the derivatives
on edges are illustrated using small letters. For example,
g denotes

Fig.
Fig.
Fig.
Fig.

2.39
2.40
2.41
2.42

Fig. 2.43
Fig. 2.44
Fig. 2.45
Fig. 2.46
Fig. 2.47

Fig. 2.48
Fig. 3.1
Fig. 3.2

Fig. 3.3

Fig. 3.4
Fig. 3.5

Fig. 3.6

dH20
dH11

...................................

Sigmoid activation function and its derivative . . . . . . . . . . .
Tangent hyperbolic activation function and its derivative . . .
The softsign activation function and its derivative . . . . . . . .
The rectified linear unit activation function and its
derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The leaky rectified linear unit activation function and its
derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The softplus activation function and its derivative . . . . . . . .
The exponential linear unit activation function and its
derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The softplus activation function and its derivative . . . . . . . .
The weights affect the magnitude of the function for a fixed
value of bias and x (left). The bias term shifts the function
to left or right for a fixed value of w and x (right) . . . . . . .
A deeper network requires less neurons to approximate
a function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Every neuron in a fully connected layers is connected
to every pixel in a grayscale image . . . . . . . . . . . . . . . . . .
We can hypothetically arrange the neurons in blocks. Here,
the neurons in the hidden layer have been arranged into
50 blocks of size 12  12 . . . . . . . . . . . . . . . . . . . . . . . .
Neurons in each block can be connected locally to the input
image. In this figure, each neuron is connected to a 5  5
region in the image . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Neurons in one block can share the same set of weights
leading to reduction in the number of parameters . . . . . . . .
The above convolution layer is composed of 49 filters of size
5. The output of the layer is obtained by convolving
each filter on the image . . . . . . . . . . . . . . . . . . . . . . . . . .
Normally, convolution filters in a ConvNet are
three-dimensional array where the first two dimensions are
arbitrary numbers and the third dimension is always equal to
the number out channels in the previous layer . . . . . . . . . .

67

69

..
..
..

72
73
74

..

74

..
..

75
76

..
..

77
78

..

78

..

81

..

86

..

87

..

87

..

88

..

89

..

90

List of Figures

Fig. 3.7

Fig. 3.8

Fig. 3.9
Fig. 3.10
Fig. 3.11
Fig. 3.12
Fig. 3.13

Fig. 3.14
Fig. 3.15
Fig. 3.16
Fig. 3.17
Fig. 3.18
Fig. 3.19
Fig. 3.20

Fig. 3.21

Fig. 3.22

Fig.
Fig.
Fig.
Fig.

3.23
3.24
3.25
4.1

From ConvNet point of view, an RGB image is a
three-channel input. The image is taken from www.
flickr.com. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Two layers from middle of a neural network indicating the
one-dimensional convolution. The weight W2 is shared
among the neurons of H2 . Also, di shows the gradient of loss
functions with respect to Hi2 . . . . . . . . . . . . . . . . . . . . . . .
A pooling layer reduces the dimensionality of each feature
map separately . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A one-dimensional max-pooling layer where the neurons
in H2 compute the maximum of their inputs. . . . . . . . . . . .
Representing LeNet-5 using a DAG . . . . . . . . . . . . . . . . .
Representing AlexNet using a DAG . . . . . . . . . . . . . . . . .
Designing a ConvNet is an iterative process. Finding a good
architecture may require several iterations
of design–implement–evaluate . . . . . . . . . . . . . . . . . . . . .
A dataset is usually partitioned into three different parts
namely training set, development set and test set. . . . . . . . .
For a binary classification problem, confusion matrix
is a 2  2 matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Confusion matrix in multiclass classification problems . . . . .
A linear model is highly biased toward data meaning that it is
not able to model nonlinearities in the data . . . . . . . . . . . .
A nonlinear model is less biased but it may model any small
nonlinearity in data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A nonlinear model may still overfit on a training set
with many samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A neural network with greater weights is capable of
modeling sudden changes in the output. The right decision
boundary is obtained by multiplying the third layer of the
neural network in left with 10. . . . . . . . . . . . . . . . . . . . . .
If dropout is activated on a layer, each neuron in the layer
will be attached to a blocker. The blocker blocks information
flow in the forward pass as well as the backward pass
(i.e., backpropagation) with probability p . . . . . . . . . . . . . .
If the learning rate is kept fixed it may jump over local
minimum (left). But, annealing the learning rate helps the
optimization algorithm to converge to a local minimum . . . .
Exponential learning rate annealing . . . . . . . . . . . . . . . . . .
Inverse learning rate annealing . . . . . . . . . . . . . . . . . . . . .
Step learning rate annealing . . . . . . . . . . . . . . . . . . . . . . .
The Caffe library uses different third-party libraries and it
provides interfaces for C++, Python, and MATLAB
programming languages . . . . . . . . . . . . . . . . . . . . . . . . . .

xix

..

91

..

92

..

96

..
..
..

97
98
100

..

101

..

105

..
..

108
109

..

115

..

116

..

116

..

118

..

120

.
.
.
.

.
.
.
.

122
122
123
124

..

132

xx

List of Figures

Fig. 4.2
Fig. 4.3
Fig. 4.4

Fig.
Fig.
Fig.
Fig.

4.5
4.6
4.7
4.8

Fig. 4.9

Fig. 5.1
Fig. 5.2
Fig. 5.3

Fig. 5.4

Fig. 5.5

Fig. 5.6

Fig. 5.7

The NetParameter is indirectly connected to many other
messages in the Caffe library . . . . . . . . . . . . . . . . . . . . . .
A computational graph (neural network) with three layers . .
Architecture of the network designed by the protobuf text.
Dark rectangles show nodes. Octagon illustrates the name
of the top element. The number of outgoing arrows in a node
is equal to the length of top array of the node. Similarly, the
number of incoming arrows to a node shows the length of
bottom array of the node. The ellipses show the tops that
are not connected to another node . . . . . . . . . . . . . . . . . . .
Diagram of the network after adding a ReLU activation. . . .
Architecture of network after adding a pooling layer . . . . . .
Architecture of network after adding a pooling layer . . . . . .
Diagram of network after adding two fully connected layers
and two dropout layers . . . . . . . . . . . . . . . . . . . . . . . . . .
Final architecture of the network. The architecture is similar
to the architecture of LeNet-5 in nature. The differences are
in activations functions, dropout layer, and connection in
middle layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Some of the challenges in classification of traffic signs.
The signs have been collected in Germany and Belgium . . .
Sample images from the GTSRB dataset . . . . . . . . . . . . . .
The image in the middle is the flipped version of the image in
the left. The image in the right another sample from dataset.
Euclidean distance from the left image to the middle image
is equal to 25;012:461 and the Euclidean distance from the
left image to the right image is equal to 27;639:447 . . . . . .
The original image in top is modified using Gaussian filtering
(first row), motion blur (second and third rows), median
filtering (fourth row), and sharpening (fifth row) with
different values of parameters . . . . . . . . . . . . . . . . . . . . . .
Augmenting the sample in Fig. 5.4 using random cropping
(first row), hue scaling (second row), value scaling
(third row), Gaussian noise (fourth row), Gaussian noise
shared between channels (fifth row), and dropout (sixth row)
methods with different configuration of parameters . . . . . . .
Accuracy of model on training and validation set tells us
whether or not a model is acceptable or it suffers from high
bias or high variance . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A ConvNet consists of two convolution-hyperbolic
activation-pooling blocks without fully connected layers.
Ignoring the activation layers, this network is composed of
five layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..
..

135
136

.
.
.
.

.
.
.
.

141
144
145
146

..

147

..

150

..
..

168
173

..

177

..

181

..

184

..

188

..

190

List of Figures

Fig. 5.8
Fig. 5.9
Fig. 5.10
Fig. 5.11

Fig. 5.12
Fig. 5.13

Fig. 5.14

Fig. 5.15
Fig. 5.16

Fig. 5.17

Fig. 5.18

Fig. 5.19

Training, validation curve of the network illustrated
in Fig. 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Architecture of the network that won the GTSRB
competition (Ciresan et al. 2012a) . . . . . . . . . . . . . . . . . . .
Training/validation curve of the network illustrated in
Fig. 5.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Architecture of network in Aghdam et al. (2016a) along with
visualization of the first fully connected layer as well as the
last two pooling layers using the t-SNE method. Light blue,
green, yellow and dark blue shapes indicate convolution,
activation, pooling, and fully connected layers, respectively.
In addition, each purple shape shows a linear transformation
function. Each class is shown with a unique color in the
scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Training/validation curve on the network illustrated in
Fig. 5.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Compact version of the network illustrated in Fig. 5.11 after
dropping the first fully connected layer and the subsequent
Leaky ReLU layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Incorrectly classified images. The blue and red numbers
below each image show the actual and predicted class labels,
respectively. The traffic sign corresponding to each class
label is illustrated in Table 5.5 . . . . . . . . . . . . . . . . . . . . .
Sample images from the BTSC dataset . . . . . . . . . . . . . . .
Incorrectly classified images from the BTSC dataset.
The blue and red numbers below each image show the actual
and predicted class labels, respectively. The traffic sign
corresponding to each class label is illustrated in Table 5.5 .
The result of fine-tuning the ConvNet on the BTSC dataset
that is trained using GTSRB dataset. Horizontal axis shows
the layer n at which the network starts the weight adjustment.
In other words, weights of the layers before the layer n are
fixed (frozen). The weights of layer n and all layers after
layer n are adjusted on the BTSC dataset. We repeated the
fine-tuning procedure 4 times for each n 2 f1; . . .; 5g,
separately. Red circles show the accuracy of each trial and
blue squares illustrate the mean accuracy. The t-SNE
visualizations of the best network for n ¼ 3; 4; 5 are also
illustrated. The t-SNE visualization is computed on the
LReLU4 layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Minimum additive noise which causes the traffic sign to be
misclassified by the minimum different compared with the
highest score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Plot of the SNRs of the noisy images found by optimizing
(5.7). The mean SNR and its variance are illustrated . . . . . .

xxi

..

193

..

194

..

195

..

196

..

198

..

199

..
..

209
210

..

211

..

213

..

216

..

216

xxii

Fig. 5.20
Fig. 5.21

Fig. 5.22

Fig. 5.23

Fig. 5.24
Fig. 5.25

Fig. 5.26
Fig. 5.27
Fig. 5.28
Fig. 5.29

Fig. 5.30

List of Figures

Visualization of the transformation and the first convolution
layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification score of traffic signs averaged over
20 instances per each traffic sign. The warmer color indicates
a higher score and the colder color shows a lower score.
The corresponding window of element ðm; nÞ in the score
matrix is shown for one instance. It should be noted that the
ðm; nÞ is the top-left corner of the window not its center and
the size of the window is 20% of the image size in all the
results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification score of traffic signs averaged over
20 instances per each traffic sign. The warmer color indicates
a higher score. The corresponding window of element ðm; nÞ
in the score matrix is shown for one instance. It should be
noted that the ðm; nÞ is the top-left corner of the window not
its center and the size of the window is 40% of the image size
in all the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification score of traffic signs averaged over
20 instances per each traffic sign. The warmer color indicates
a higher score. The corresponding window of element ðm; nÞ
in the score matrix is shown for one instance. It should be
noted that the ðm; nÞ is the top-left corner of the window not
its center and the size of the window is 40% of the image size
in all the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Receptive field of some neurons in the last pooling layer . . .
Average image computed over each of 250 channels using
the 100 images with highest value in position ð0; 0Þ
of the last pooling layer. The corresponding receptive field of
this position is shown using a cyan rectangle . . . . . . . . . . .
The modified ConvNet architecture compare
with Fig. 5.11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Relation between the batch size and time-to-completion
of the ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Misclassified traffic sings. The blue and the red number
indicate the actual and predicted class labels, respectively . .
Lipschitz constant (top) and the correlation between
dðx; x þ Nð0; rÞÞ and dðCfc2 ðxÞ; Cfc2 ðx þ Nð0; rÞÞÞ
(bottom) computed on 100 samples from every category in
the GTSRB dataset. The red circles are the noisy instances
that are incorrectly classified. The size of each circle is
associated with the values of r in the Gaussian noise . . . . .
Visualizing the relu4 (left) and the pooling3 (right) layers in
the classification ConvNet using the t-SNE method. Each
class is shown using a different color. . . . . . . . . . . . . . . . .

..

217

..

218

..

219

..
..

220
221

..

221

..

223

..

224

..

226

..

228

..

229

List of Figures

Fig. 5.31
Fig. 6.1
Fig. 6.2

Fig. 6.3
Fig. 6.4
Fig. 6.5
Fig. 6.6

Fig. 6.7

Fig. 6.8
Fig. 6.9

Fig. 6.10

Fig. 7.1

Fig. 7.2
Fig. 7.3
Fig. 7.4

Histogram of leaking parameters . . . . . . . . . . . . . . . . . . . .
The detection module must be applied on a high-resolution
image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The ConvNet for detecting traffic signs. The blue,green, and
yellow color indicate a convolution, LReLU and pooling
layer, respectively. Cðc; n; kÞ denotes n convolution kernel of
size k  k  c and Pðk; sÞ denotes a max-pooling layer
with pooling size k  k and stride s. Finally, the number in
the LReLU units indicate the leak coefficient of the activation
function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Applying the trained ConvNet for hard-negative mining. . . .
Implementing the sliding window detector within the
ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Architecture of the sliding window ConvNet . . . . . . . . . . .
Detection score computed by applying the fully
convolutional sliding network to 5 scales of the
high-resolution image . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time to completion of the sliding ConvNet for different
strides. Left time to completion per resolution and Right
cumulative time to completion . . . . . . . . . . . . . . . . . . . . .
Distribution of traffic signs in different scales computed
using the training data . . . . . . . . . . . . . . . . . . . . . . . . . . .
Top precision-recall curve of the detection ConvNet along
with models obtained by HOG and LBP features. Bottom
Numerical values (%) of precision and recall for the
detection ConvNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Output of the detection ConvNet before and after
post-processing the bounding boxes. A darker bounding box
indicate that it is detected in a lower scale image . . . . . . . .
Visualizing classes of traffic signs by maximizing the
classification score on each class. The top-left image
corresponds to class 0. The class labels increase from left to
right and top to bottom . . . . . . . . . . . . . . . . . . . . . . . . . .
Visualizing class saliency using a random sample from each
class. The order of images is similar Fig. 7.1 . . . . . . . . . . .
Visualizing expected class saliency using 100 samples from
each class. The order of images is similar to Fig. 7.1. . . . . .
Reconstructing a traffic sign using representation of different
layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xxiii

..

230

..

236

..
..

237
238

..
..

240
241

..

241

..

242

..

243

..

244

..

245

..

252

..

253

..

254

..

257

1

Traffic Sign Detection and Recognition

1.1 Introduction
Assume you are driving at speed of 90 km/h in a one-way road and you are about
to join a new road. Even though there was a “danger: two way road” sign in the
junction, you have not seen the sign and you keep driving in opposite lane of the new
road. This is a hazardous situation which may end up with a fatal accident because
the driver assumes he or she is still driving in a two-way road. This was only a simple
example in which failing to detect traffic sign may cause irreversible consequences.
This danger gets even more serious with inexperienced drivers and senior drivers,
specially, in unfamiliar roads.
According to National Safety Council, medically consulted motor-vehicle injuries
for the first 6 months of 2015 were estimated to be about 2,254,000.1 Also, World
Health Organization reported that2 there have been about 1,250,000 fatalities in 2015
due to car accidents. Moreover, another study shows that human error accounts solely
for 57% of all accidents and it is a contributing factor in over 90% of accidents. The
above example is one of the scenarios which may occur because of failing to identify
traffic signs.
Furthermore, self-driving cars are going be commonly used in near future. They
must also conform with the road rules in order not to endanger other users of road.
Likewise, smart-cars try to assist human drivers and make driving more safe and
comfortable. Advanced Driver Assistant System (ADAS) is a crucial component on
these cars. One of the main tasks of this module is to recognize traffic signs. This
helps a human driver to be aware of all traffic signs and have a more safe driving
experience.

1 www.nsc.org/NewsDocuments/2015/6-month-fatality-increase.pdf.
2 www.who.int/violence_injury_prevention/road_safety_status/2015/GSRRS2015_data/en/.

© Springer International Publishing AG 2017
H. Habibi Aghdam and E. Jahani Heravi, Guide to Convolutional
Neural Networks, DOI 10.1007/978-3-319-57550-6_1

1

2

1 Traffic Sign Detection and Recognition

1.2 Challenges
A Traffic signs recognition module is composed of two main steps including detection
and classification. This is shown in Fig. 1.1. The detection stage scans image of scene
in a multi-scale fashion and looks for location of traffic signs on the image. In this
stage, the system usually does not distinguish one traffic sign from another. Instead,
it decides whether or not a region includes a traffic sign regardless of its type. The
output of detection stage is a set of regions in the image containing traffic signs. As
it is shown in the figure, detection module might make mistakes and generate a few
false-positive traffic signs. In other words, there could be a few regions in the output
of detection module without any traffic sign. These outputs have been marked using
a red (dashed) rectangle in the figure.
Next, classification module analyzes each region separately and determines type of
each traffic sign. For example, there is one “no turning to left” sign, one “roundabout”
sign, and one “give way” sign in the figure. There are also three “pedestrian crossing”

Fig. 1.1 Common pipeline
for recognizing traffic signs

1.2 Challenges

3

signs. Moreover, even though there is no traffic sign inside the false-positive regions,
the classification module labels them into one of traffic sign classes. In this example,
the false-positive regions have been classified as “speed limit 100” and “no entry”
signs.
Dealing with false-positive regions generated by detection module is one of major
challenges in developing a practical traffic sign recognition system. For instance, a
self-driving car may suddenly brake in the above hypothetical example because it has
detected a no entry sign. Consequently, one of the practical challenges in developing
a detection module is to have zero false-positive region. Also, it has to detect all traffic
signs in the image. Technically, its true-positive rate must be 100%. Satisfying these
two criteria is not trivial in practical applications.
There are two major goals in designing traffic signs. First, they must be easily
distinguishable from rest of objects in scene and, second, their meaning must be
easily perceivable and independent from spoken language. To this end, traffic signs
are designed with a simple geometrical shape such as triangle, circle, rectangle, or
polygon. To be easily detectable from rest of objects, traffic signs are painted using
basic colors such as red, blue, yellow, black, and white. Finally, the meaning of
traffic signs is mainly carried out by pictographs in center of traffic signs. It should
be noted that some signs heavily depend on text-based information. However, we
can still think of the texts in traffic signs as pictographs.
Although classification of traffic signs is an easy task for a human, there are some
challenges in developing an algorithm for this purpose. Some of these challenges are
illustrated in Fig. 1.2. First, image of traffic signs might be captured from different
perspectives. This may nonlinearly deform the shape of traffic signs.
Second, weather condition can dramatically affect the appearance of traffic signs.
An example is illustrated in Fig. 1.2 where the “no stopping” sign is covered by snow.
Third, traffic signs are being impaired during time and some artifacts may appear on
signs which might have a negative impact on their classification. Fourth, traffic signs
might be partially occluded by another signs or objects. Fifth, the pictograph area
might be manipulated by human which in some cases might change the shape of the
pictograph. Another important challenge is illumination variation caused by weather
condition or daylight changes. The last and more important issue shown in this figure
is pictograph differences of the same traffic sign from one country to another. More
specifically, we observe that the “danger: bicycle crossing” sign posses important
differences between two countries.
Referring to the Vienna Convention on Road Traffic Signs, we can find roughly
230 pictorial traffic signs. Here, text-based signs and variations on pictorial signs
are counted. For example, the speed limit sign can have 24 variations including 12
variation for indicating speed limits and 12 variations for end of speed limit. Likewise,
traffic signs such as recommended speed, minimum speed, minimum distance with
front car, etc. may have several variations. Hence, traffic sign recognition is a large
multi-class classification problem. This makes the problem even more challenging.
Note that some of the signs such as “crossroad” and “side road” signs differ only
by very fine details. This is shown in Fig. 1.3 where both signs differ only in small

4

1 Traffic Sign Detection and Recognition

Fig. 1.2 Some of the challenges in classification of traffic signs. The signs have been collected in
Germany and Belgium
Fig. 1.3 Fine differences
between two traffic signs

part of pictograph. Looking at Fig. 1.1, we see signs which are only 30 m away from
the camera occupy very small region in the image. Sometimes, these regions can
be as small as 20 × 20 pixels. For this reason, identifying fine details become very
difficult on these signs.
In sum, traffic sign classification is a specific case of object classification where
the objects are more rigid and two dimensional. Also, their discriminating parts are
well defined. However, there are many challenges in developing a practical system
for detection and classification of traffic signs.

1.3 Previous Work

5

1.3 Previous Work
1.3.1 Template Matching
Arguably, the most trivial way for recognizing objects is template matching. In this
method, a set of templates is stored on system. Given a new image, the template is
matched with every location on the image and a score is computed for each location.
The score might be computed using cross correlation, sum of squared differences,
normalized cross correlation, or normalized sum of squared differences. Piccioli
et al. (1996) stored a set of traffic signs as the templates. Then, the above approach
was used in order to classify the input image. Note that template-matching approach
can be used for both detection and classification.
In practice, there are many problems with this approach. First, it is sensitive to
perspective, illumination and deformation. Second, it is not able to deal with low
quality signs. Third, it might need a large dataset of templates to cover various kinds
of samples for each traffic sign. For this reason, selecting appropriate templates is a
tedious task.
On the one hand, template matching compares raw pixel intensities between the
template and the source image. On the other hand, pixel intensities greatly depend
on perspective, illumination, and deformation. As the result, a slight change in illumination may affect the matching score, significantly. To tackle with this problem,
we usually apply some algorithms on the image in order to extract more useful information from it. In other words, in the case of grayscale images, a feature extraction
algorithm accepts a W × H image and transforms the W × H dimensional vector
into a D-dimensional vector in which the D-dimensional vector carries more useful
information about the image and it is more tolerant to perspective changes, illumination, and deformation. Based on this idea, Gao et al. (2006) extracted shape features
from both template and source image and matched these feature vectors instead
of raw pixel intensity values. In this work, matching features were done using the
Euclidean distance function. This is equivalent to the sum of square differences
function. The main problem with this matching function was that every feature was
equally important. To cope with this problem, Ruta et al. (2010) learned a similarity
measure for matching the query sign with templates.

1.3.2 Hand-Crafted Features
The template matching procedure can be decomposed into two steps. In the first step,
a template and an image patch are represented using more informative vectors called
feature vectors. In the second step, feature vectors are compared in order to find class
of the image patch. This approach is illustrated in Fig. 1.4. Traditionally, the second
step is done using techniques of machine learning. We will explain the basics of
this step in Sect. 2.1. However, roughly speaking, extracting a feature vector from an
image can be done using hand-crafted or automatic methods.

6

1 Traffic Sign Detection and Recognition

Fig. 1.4 Traditional
approach for classification of
objects

Hand-crafted methods are commonly designed by a human expert. They may
apply series of transformations and computations in order to build a feature vector.
For example, Paclík et al. (2000) generated a binary image depending on color of
traffic sign. Then, moment invariant features were extracted from the binary image
to form the feature vector. This method could be very sensitive to noise since a clean
image and its degraded version may have two different binary images. Consequently,
the moments of the binary images might vary significantly. Maldonado-Bascon et al.
(2007) transformed the image into the HSI color space and calculated histogram of
Hue and Saturation components. Although this feature vector can distinguish general
category of traffic signs (for example, mandatory vs. danger signs), they might act
poorly on modeling traffic signs of the same category. This is due to the fact that
traffic signs of the same category have the same color and shape. For instance, all
danger signs are triangle with a red margin. Therefore, the only difference would be
the pictograph of signs. Since all pictographs are black, they will fall into the same
bin on this histogram. As the result, theoretically, this bin will be the main source of
information for classifying signs of same category.
In another method, Maldonado Bascón et al. (2010) classified traffic signs using
only the pictograph of each sign. To this end, they first segment the pictograph from
the image of traffic sign. Although the region of pictograph is binary, accurate segmentation of a pictograph is not a trivial task since automatic thresholding methods
such as Otsu might fail taking into account the illumination variation and unexpected
noise in real-world applications. For this reason, Maldonado Bascón et al. (2010)
trained SVM where the input is a 31 × 31 block of pixels in a grayscale version of
pictograph. In a more complicated approach, Baró et al. (2009) proposed an Error
Correcting Output Code framework for classification of 31 traffic signs and compared
their method with various approaches.
Zaklouta et al. (2011), Zaklouta and Stanciulescu (2012), and Zaklouta and
Stanciulescu (2014) utilized more sophisticated feature extraction algorithm called
Histogram of Oriented Gradient (HOG). Broadly speaking, the first step in extracting
HOG feature is to compute the gradients of the image in x and y directions. Then, the
image is divided into non-overlapping regions called cells. A histogram is computed
for each cell. Bins of the histogram show the orientation of the gradient vector. Value
of each bin is computed by accumulating the gradient magnitudes of the pixels in
each cell. Next, blocks are formed using neighbor cells. Blocks may have overlap
with each other. Histogram of a block is obtained by concatenating histograms of
the cells within the block. Finally, histogram of each block is normalized and final
feature vector is obtained by concatenating the histogram of all blocks.
This method is formulated using size of each cell, size of each block, number of
bins in histograms of cell, and type of normalization. These parameters are called

1.3 Previous Work

7

hyperparameters. Depending on the value of these parameters we can obtain different feature vectors with different lengths on the same image. HOG is known to be a
powerful hand-crafted feature extraction algorithm. However, objects might not be
linearly separable in the feature space. For this reason, Zaklouta and Stanciulescu
(2014) trained a Random Forest and a SVM for classifying traffic sings using HOG
features. Likewise, Greenhalgh and Mirmehdi (2012), Moiseev et al. (2013), Mathias
et al. (2013), Huang et al. (2013), and Sun et al. (2014) extracted the HOG features.
The difference between these works mainly lies on their classification model (e.g.,
SVM, Cascade SVM, Extreme Learning Machine, Nearest Neighbor, and LDA).
However, in contrast to the other works, Huang et al. (2013) used a two-level classification model. In the first level, the image is classified into one of super-classes.
Each super-class contains several traffic signs with similar shape/color. Then, the
perspective of the input image is adjusted based on its super-class and another classification model is applied on the adjusted image. The main problem of this method
is sensitivity of the final classification to the adjustment procedure.
Mathias et al. (2013) proposed a more complicated procedure for extracting features. Specifically, the first extracted HOG features with several configurations of
hyperparameters. In addition, they extracted more feature vectors using different
methods. Finally, they concatenated all these vectors and built the final feature vector. Notwithstanding, there are a few problems with this method. Their feature vector
is a 9000-dimensional vector constructed by applying five different methods. This
high-dimensional vector is later projected to a lower dimensional space using a
transformation matrix.

1.3.3 Feature Learning
A hand-crafted feature extraction method is developed by an expert and it applies
series of transformations and computations in order to extract the final vector. The
choice of these steps completely depends on the expert. One problem with the handcrafted features is their limited representation power. This causes that some classes
of objects overlap with other classes which adversely affect the classification performance. Two common approaches for partially alleviating this problem are to develop
a new feature extraction algorithm and to combine various methods. The problems
with these approaches are that devising a new hand-crafted feature extraction method
is not trivial and combining different methods might not separate the overlapping
classes.
The basic idea behind feature learning is to learn features from data. To be more
specific, given a dataset of traffic signs, we want to learn a mapping M : Rd → Rn
which accepts d = W × H -dimensional vectors and returns an n-dimensional vector.
Here, the input is a flattened image that is obtained by putting the rows of image
next to each other and creating a one-dimensional array. The mapping M can be
any arbitrary function that is linear or nonlinear. In the simplest scenario, M can be

8

1 Traffic Sign Detection and Recognition

a linear function such as
M (x) = W + (x T − x̄),

(1.1)

where W ∈ Rd×n is a weight matrix, x ∈ Rd is the flattened image, and x̄ ∈ Rd is
the flattened mean image. Moreover, W + = (W T W )−1 W T denotes the MoorePenrose pseudoinverse of W . Given the matrix W we can map every image into a
n-dimensional space using this linear transformation. Now, the question is how to
find the values of W ?
In order to obtain W , we must devise an objective and try to get as close as possible
to the objective by changing the values of W . For example, assume our objective is
to project x into a five-dimensional space where the projection is done arbitrarily.
It is clear that any W ∈ R3×d will serve our purpose. Denoting M (x) with z, our
aim might be to project data on a n ≤ d space while maximizing the variance of z.
The W that is found using this objective function is called principal component
analysis. Bishop (2006) has explained that to find W that maximizes this objective
function, we must compute the covariance matrix of data and find eigenvectors and
eigenvalues of the covariance matrix. Then, the eigenvectors are sorted according
to their eigenvalues in descending order and the first n eigenvectors are picked to
form W .
Now, given any W × H image, we plug it in (1.1) to compute z. Then, the ndimensional vector z is used as the feature vector. This method is previously used
by Sirovich and Kirby (1987) for modeling human faces. Fleyeh and Davami (2011)
also projected the image into the principal component space and found class of the
image by computing the Euclidean distance of the projected image with the images
in the database.
If we multiply both sides with W and rearrange (1.1) we will obtain
x T = W z + x̄ T .

(1.2)

Assume that x̄ T = 0. Technically, we say our data is zero-centered. According to
this equation, we can reconstruct x using W and its mapping z. Each column in W
is a d-dimensional vector which can be seen as a template learnt from data. With
this intuition, the first row in W shows set of values of first pixel in our dictionary
of templates. Likewise, n th row in W is set of values of n th pixel in the templates.
Consequently, the vector z shows how to linearly combine these templates in order to
reconstruct the original image. As the value of n increases, the reconstruction error
decreases.
The value of W depends directly on the data that we have used during the training
stage. In other words, using the training data, a system learns to extract features.
However, we do not take into account the class of objects in finding W . In general,
methods that do not consider the class of object are called unsupervised methods.
One limitation of principal component analysis is that n ≤ d. Also, z is a real vector
which is likely to be non-sparse. We can simplify (1.2) by omitting the second term.

1.3 Previous Work

9

Now, our objective is to find W and z by minimizing the constrained reconstruction
error:
E=

N


xiT − W z i 22 s.t. z1 < µ,

(1.3)

i=1

where µ is a user-defined value and N is the number of training images. W and
z i also have the same meaning as we mentioned above. The L1 constrained in the
above equation forces z i to be sparse. A vector is called sparse when most of its
elements are zero. Minimizing the above objective function requires an alternative
optimization of W and z i . This method is called sparse coding. Interested readers
can find more details in Mairal et al. (2014). It should be noted that there are other
formulations for objective function and the constraint.
There are two advantages with the sparse coding approach compared with principal component analysis. First, the number of columns in W (i.e., n) is not restricted
to be smaller than d. Second, z i is a sparse vector. Sparse coding has been also used
to encode images of traffic signs.
Hsu and Huang (2001) coded each traffic sign using the Matching Pursuit algorithm. During testing, the input image is projected to different sets of filter bases to
find the best match. Lu et al. (2012) proposed a graph embedding approach for classifying traffic signs. They preserved the sparse representation in the original space
using L 1,2 norm. Liu et al. (2014) constructed the dictionary by applying k-means
clustering on the training data. Then, each data is coded using a novel coding input
similar to Local Linear Coding approach (Wang et al. 2010). Moreover, Aghdam et al.
(2015) proposed a method based on visual attributes and Bayesian network. In this
method, each traffic sign is described in terms of visual attributes. In order to detect
visual attributes, the input image is divided into several regions and each region is
coded using elastic net sparse coding method. Finally, attributes are detected using a
random forest classifier. The detected attributes are further refined using a Bayesian
network. Figure 1.5 illustrates a dictionary learnt by Aghdam et al. (2015) from 43
classes of traffic signs.
There are other unsupervised feature learning techniques. Among them, autoencoders, deep belief networks, and independent component analysis have been extensively studied and used in the computer vision community. One of the major disadvantages of unsupervised feature learning methods is that they do not consider the
class of objects during the learning process. More accurate results have been obtained
using supervised feature learning methods. As we will discuss in Chap. 3, convolutional neural networks (ConvNet) have shown a great success in classification and
detection of objects.

10

1 Traffic Sign Detection and Recognition

Fig. 1.5 Dictionary learnt by Aghdam et al. (2015) from 43 classes of traffic signs

1.3.4 ConvNets
3 ConvNets were first utilized by Sermanet and Lecun (2011) and Ciresan et al. (2012)

in the field of traffic sign classification during the German Traffic Sign Recognition
Benchmark (GTSRB) competition where the ensemble of ConvNets designed by
Ciresan et al. (2012) surpassed human performance and won the competition by
correctly classifying 99.46% of test images. Moreover, the ConvNet of Sermanet
and Lecun (2011) ended up in the second place with a considerable difference compared with the third place which was awarded for a method based on the traditional
classification approach. The classification accuracies of the runner-up and the third
place were 98.97 and 97.88%, respectively.
Ciresan et al. (2012) constructs an ensemble of 25 ConvNets each consists of
1,543,443 parameters. Sermanet and Lecun (2011) creates a single network defined
by 1,437,791 parameters. Furthermore, while the winner ConvNet uses the hyperbolic activation function, the runner-up ConvNet utilizes the rectified sigmoid as the
activation function. It is a common practice in ConvNets to make a prediction by
calculating the average score of slightly transformed versions of the query image.

3 We

shall explain all technical details of this section in the rest of this book.

1.3 Previous Work

11

However, it is not clearly mentioned in Sermanet and Lecun (2011) that how do they
make a prediction. In particular, it is not clear that the runner-up ConvNet classifies
solely the input image or it classifies different versions of the input and fuses the
scores to obtain the final result.
Regardless, both methods suffer from the high number of arithmetic operations.
To be more specific, they use highly computational activation functions. To alleviate
these problems, Jin et al. (2014) proposed a new architecture including 1,162,284
parameters and utilizing the rectified linear unit (ReLU) activations (Krizhevsky et al.
2012). In addition, there is a Local Response Normalization layer after each activation
layer. They built an ensemble of 20 ConvNets and classified 99.65% of test images
correctly. Although the number of parameters is reduced using this architecture
compared with the two networks, the ensemble is constructed using 20 ConvNets
which is not still computationally efficient in real-world applications. It is worth
mentioning that a ReLU layer and a Local Response Normalization layer together
needs approximately the same number of arithmetic operations as a single hyperbolic
layer. As the result, the run-time efficiency of the network proposed in Jin et al. (2014)
might be close to Ciresan et al. (2012).
Recently, Zeng et al. (2015) trained a ConvNet to extract features of the image and
replaced the classification layer of their ConvNet with an Extreme Learning Machine
(ELM) and achieved 99.40% accuracy on the GTSRB dataset. There are two issues
with their approach. First, the output of last convolution layer is a 200-dimensional
vector which is connected to 12,000 neurons in the ELM layer. This layer is solely
defined by 200 × 12,000 + 12,000 × 43 = 2,916,000 parameters which makes it
impractical. Besides, it is not clear why their ConvNet reduces the dimension of the
feature vector from 250 × 16 = 4000 in Layer 7 to 200 in Layer 8 and then map
their lower dimensional vector to 12,000 dimensions in the ELM layer (Zeng et al.
2015, Table 1). One reason might be to cope with calculation of the matrix inverse
during training of the ELM layer. Finally, since the input connections of the ELM
layer are determined randomly, it is probable that their ConvNet does not generalize
well on other datasets.
The common point about all the above ConvNets is that they are only suitable for
the classification module and they cannot be directly used in the task of detection.
This is due to the fact that applying these ConvNets on high-resolution images is not
computationally feasible. On the other hand, accuracy of the classification module
also depends on the detection module. In other words, any false-positive results
produced by the detection module will be entered into the classification module
and it will be classified as one of traffic signs. Ideally, the false-positive rate of the
detection module must be zero and its true-positive rate must be 1. Achieving this
goal usually requires more complex image representation and classification models.
However, as the complexity of these models increases, the detection module needs
more time to complete its task.
The ConvNets proposed for traffic sign classification can be explained from three
perspectives including scalability, stability, and run-time. From generalization point
of view, none of the four ConvNets have assessed the performance on other datasets. It
is crucial to study how the networks perform when the signs slightly change from one

12

1 Traffic Sign Detection and Recognition

country to another. More importantly, the transferring power of the network must be
estimated by fine-tuning the same architecture on a new dataset with various numbers
of classes. By this way, we are able to estimate the scalability of the networks. From
stability perspective, it is crucial to find out how tolerant is the network against noise
and occlusion. This might be done through generating a few noisy images and fetch
them to the network. However, this approach does not find the minimum noisy image
which is misclassified by the network. Finally, the run-time efficiency of the ConvNet
must be examined. This is due to the fact that the ConvNet has to consume as few
CPU cycles as possible to let other functions of ADAS perform in real time.

1.4 Summary
In this chapter, we formulated the problem of traffic sign recognition in two stages
namely detection and classification. The detection stage is responsible for locating
regions of image containing traffic signs and the classification stage is responsible
for finding class of traffic signs. Related work in the field of traffic sign detection and
classification is also reviewed. We mentioned several methods based on hand-crafted
features and then introduced the idea behind feature learning. Then, we explained
some of the works based on convolutional neural networks.

References
Aghdam HH, Heravi EJ, Puig D (2015) A unified framework for coarse-to-fine recognition of traffic
signs using Bayesian network and visual attributes. In: Proceedings of the 10th international conference on computer vision theory and applications, pp 87–96. doi:10.5220/0005303500870096
Baró X, Escalera S, Vitrià J, Pujol O, Radeva P (2009) Traffic sign recognition using evolutionary
adaboost detection and forest-ECOC classification. IEEE Trans Intell Transp Syst 10(1):113–126.
doi:10.1109/TITS.2008.2011702
Bishop CM (2006) Pattern recognition and machine learning. Information science and statistics.
Springer, New York
Ciresan D, Meier U, Schmidhuber J (2012) Multi-column deep neural networks for image classification. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 3642–3649.
doi:10.1109/CVPR.2012.6248110, arXiv:1202.2745v1
Fleyeh H, Davami E (2011) Eigen-based traffic sign recognition. IET Intell Transp Syst 5(3):190.
doi:10.1049/iet-its.2010.0159
Gao XW, Podladchikova L, Shaposhnikov D, Hong K, Shevtsova N (2006) Recognition of traffic
signs based on their colour and shape features extracted using human vision models. J Visual
Commun Image Represent 17(4):675–685. doi:10.1016/j.jvcir.2005.10.003
Greenhalgh J, Mirmehdi M (2012) Real-Time Detection and Recognition of Road Traffic Signs.
Ieee Transactions on Intelligent Transportation Systems 13(4):1498–1506. doi:10.1109/tits.2012.
2208909

References

13

Hsu SH, Huang CL (2001) Road sign detection and recognition using matching pursuit method.
Image Vis Comput 19(3):119–129. doi:10.1016/S0262-8856(00)00050-0
Huang GB, Mao KZ, Siew CK, Huang DS (2013) A hierarchical method for traffic sign classification
with support vector machines. In: The 2013 international joint conference on neural networks
(IJCNN). IEEE, pp 1–6. doi:10.1109/IJCNN.2013.6706803
Jin J, Fu K, Zhang C (2014) Traffic sign recognition with hinge loss trained convolutional neural
networks. IEEE Trans Intell Transp Syst 15(5):1991–2000. doi:10.1109/TITS.2014.2308281
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural
networks. In: Advances in neural information processing systems. Curran Associates, Inc., pp
1097–1105
Liu H, Liu Y, Sun F (2014) Traffic sign recognition using group sparse coding. Inf Sci 266:75–89.
doi:10.1016/j.ins.2014.01.010
Lu K, Ding Z, Ge S (2012) Sparse-representation-based graph embedding for traffic sign recognition.
IEEE Trans Intell Transp Syst 13(4):1515–1524. doi:10.1109/TITS.2012.2220965
Mairal J, Bach F, Ponce J (2014) Sparse modeling for image and vision processing. Found Trends
Comput Graph Vis 8(2–3):85–283. doi:10.1561/0600000058
Maldonado Bascón S, Acevedo Rodríguez J, Lafuente Arroyo S, Fernndez Caballero A, LópezFerreras F (2010) An optimization on pictogram identification for the road-sign recognition task
using SVMs. Comput Vis Image Underst 114(3):373–383. doi:10.1016/j.cviu.2009.12.002
Maldonado-Bascon S, Lafuente-Arroyo S, Gil-Jimenez P, Gomez-Moreno H, Lopez-Ferreras F
(2007) Road-sign detection and recognition based on support vector machines. IEEE Trans Intell
Transp Syst 8(2):264–278. doi:10.1109/TITS.2007.895311
Mathias M, Timofte R, Benenson R, Van Gool L (2013) Traffic sign recognition - How far are
we from the solution? In: Proceedings of the international joint conference on neural networks.
doi:10.1109/IJCNN.2013.6707049
Moiseev B, Konev A, Chigorin A, Konushin A (2013) Evaluation of traffic sign recognition methods
trained on synthetically generated data. In: 15th international conference on advanced concepts
for intelligent vision systems (ACIVS). Springer, Poznań, pp 576–583
Paclík P, Novovičová J, Pudil P, Somol P (2000) Road sign classification using Laplace kernel
classifier. Pattern Recognit Lett 21(13–14):1165–1173. doi:10.1016/S0167-8655(00)00078-7
Piccioli G, De Micheli E, Parodi P, Campani M (1996) Robust method for road sign detection and
recognition. Image Vis Comput 14(3):209–223. doi:10.1016/0262-8856(95)01057-2
Ruta A, Li Y, Liu X (2010) Robust class similarity measure for traffic sign recognition. IEEE Trans
Intell Transp Syst 11(4):846–855. doi:10.1109/TITS.2010.2051427
Sermanet P, Lecun Y (2011) Traffic sign recognition with multi-scale convolutional networks. In:
Proceedings of the international joint conference on neural networks, pp 2809–2813. doi:10.
1109/IJCNN.2011.6033589
Sirovich L, Kirby M (1987) Low-dimensional procedure for the characterization of human faces.
J Opt Soc Am A 4(3):519–524. doi:10.1364/JOSAA.4.000519, http://josaa.osa.org/abstract.cfm?
URI=josaa-4-3-519
Sun ZL, Wang H, Lau WS, Seet G, Wang D (2014) Application of BW-ELM model on traffic sign
recognition. Neurocomputing 128:153–159. doi:10.1016/j.neucom.2012.11.057
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for
image classification. In: 2010 IEEE computer society conference on computer vision and pattern
recognition. IEEE, pp 3360–3367. doi:10.1109/CVPR.2010.5540018
Zaklouta F, Stanciulescu B (2012) Real-time traffic-sign recognition using tree classifiers. IEEE
Trans Intell Transp Syst 13(4):1507–1514. doi:10.1109/TITS.2012.2225618
Zaklouta F, Stanciulescu B (2014) Real-time traffic sign recognition in three stages. Robot Auton
Syst 62(1):16–24. doi:10.1016/j.robot.2012.07.019

14

1 Traffic Sign Detection and Recognition

Zaklouta F, Stanciulescu B, Hamdoun O (2011) Traffic sign classification using K-d trees and
random forests. In: Proceedings of the international joint conference on neural networks, pp
2151–2155. doi:10.1109/IJCNN.2011.6033494
Zeng Y, Xu X, Fang Y, Zhao K (2015) Traffic sign recognition using deep convolutional networks
and extreme learning machine. In: Intelligence science and big data engineering. Image and video
data engineering (IScIDE). Springer, pp 272–280

2

Pattern Classification

Machine learning problems can be broadly classified into supervised learning, unsupervised learning and reinforcement learning. In supervised learning, we have set of
feature vectors and their corresponding target values. The aim of supervised learning
is to learn a model to accurately predict targets given unseen feature vectors. In other
words, the computer must learn a mapping from feature vectors to target values.
The feature vectors might be called independent variable and the target values might
be called dependent variable. Learning is done using and objective function which
directly depends on target values. For example, classification of traffic signs is a
supervised learning problem.
In unsupervised setting, we only have a set of feature vectors without any target
value. The main goal of unsupervised learning is to learn structure of data. Here,
because target values do not exist, there is not a specific way to evaluate learnt
models. For instance, assume we have a dataset with 10,000 records in which each
data is a vector consists of [driver’s age, driver’s gender, driver’s education level,
driving experience, type of car, model of car, car manufacturer, GPS point of accident,
temperature, humidity, weather condition, daylight, time, day of week, type of road].
The goal might be to divide this dataset into 20 categories. Then, we can analyze
categories to see how many records fall into each category and what is common
among these records. Using this information, we might be able to say in which
conditions car accidents happen more frequently. As we can see in this example,
there is not a clear way to tell how well the records are categorized.
Reinforcement learning usually happens in dynamic environments where series of
actions lead the system into a point of getting a reward or punishment. For example,
consider a system that is learning to drive a car. The system starts to driver and
several seconds later it hits an obstacle. Series of actions has caused the system to
hit the obstacle. Notwithstanding, there is no information to tell us how good was
the action which the systems performed at a specific time. Instead, the system is

© Springer International Publishing AG 2017
H. Habibi Aghdam and E. Jahani Heravi, Guide to Convolutional
Neural Networks, DOI 10.1007/978-3-319-57550-6_2

15

16

2 Pattern Classification

punished because it hit the obstacle. Now, the system must figure out which actions
were not correct and act accordingly.

2.1 Formulation
Supervised learning mainly breaks down into classification and regression. The
main difference between them is the type of target values. While target values of a
regression problem are real/discrete numbers, target values of a classification problem are categorical numbers which are called labels. To be more specific, assume
Fr : Rd → R is a regression model which returns a real number. Moreover, assume
we have the pair (xr , yr ) including a d-dimensional input vector xr and real number
yr . Ideally, Fr (xr ) must be equal to yr . In other words, we can evaluate the accuracy
of the prediction by simply computing |Fr (xr ) − yi |.
In contrast, assume the classification model
Fc : Rd → {speedlimit, danger, pr ohibitive, mandator y}

(2.1)

which returns a categorical number/label. Given the pair (xc , danger ), Fc (xc ) must
be ideally equal to danger. However, it might return mandatory wrongly. It is not
possible to simply subtract the output of Fc with the actual label to ascertain how
much the model has deviated from actual output. The reason is that there is not a
specific definition of distance between labels. For example, we cannot tell what is
the distance between “danger” and “prohibitive” or “danger” and “mandatory”. In
other words, the label space is not an ordered set. Both traffic sign detection and
recognition problems are formulated using a classification model. In the rest of this
section, we will explain the fundamental concepts using simple examples.
Assume a set of pairs X = {(x0 , y0 ), . . . , (xn , yn )} where xi ∈ R2 is a twodimensional input vector and yi ∈ {0, 1} is its label. Despite the fact that 0 and
1 are numbers, we treat them as categorical labels. Therefore, it is not possible to
compute their distance. The target value yi in this example can only take one of the
two values. These kind of classification problems in which the target value can only
take two values are called binary classification problems. In addition, because the
input vectors are two-dimensional we can easily plot them. Figure 2.1 illustrates the
scatter plot of a sample X .
The blue squares show the points belonging to one class and the pink circles depicts
the points belonging to the other class. We observe that the two classes overlap inside
the green polygon. In addition, the vectors shown by the green arrows are likely to be
noisy data. More importantly, these two classes are not linearly separable. In other
words, it is not possible to perfectly separate these two classes from each other by
drawing a line on the plane.
Assume we are given a xq ∈ R2 and we are asked to tell which class xq belongs
to. This point is shown using a black arrow on the figure. Note that we do not know
the target value of xq . To answer this question, we first need to learn a model from

2.1 Formulation

17

Fig. 2.1 A dataset of
two-dimensional vectors
representing two classes of
objects

X which is able to discriminate the two classes. There are many ways to achieve
this goal in literature. However, we are only interested in a particular technique
called linear models. Before explaining this technique, we mention a method called
k-nearest neighbor.

2.1.1 K-Nearest Neighbor
From one perspective, machine learning models can be categorized into parametric and nonparametric models. Roughly speaking, parametric models have some
parameters which are directly learnt from data. In contrast, nonparametric models
do not have any parameters to be learnt from data. K-nearest neighbor (KNN) is a
nonparametric method which can be used in regression and classification problem.
Given the training set X , KNN stores all these samples in memory. Then, given
the query vector xq , it finds K closest samples from X to xq .1 Denoting the K
closest neighbors of xq with N K (xq ; X ),2 the class of xq is determined by:
F(xq ) = arg max



v∈{0,1} p∈N (x )
K q

1 Implementations
2 You

δ(v, f ( p))

of the methods in this chapter are available at github.com/pcnn/.
can read this formula as “N K of xq given the dataset X ”.

(2.2)

18

2 Pattern Classification

Fig. 2.2 K-nearest neighbor
looks for the K closets points
in the training set to the
query point


δ(a, b) =

1
0

a=b
a = b

(2.3)

where f ( p) returns the label of training sample p ∈ X . Each of K closest neighbors
vote for xq according to their label. Then, the above equation counts the votes and
returns the majority of votes as the class of xq . We explain the meaning of this
equation on Fig. 2.2. Assuming K = 1, KNN looks for the closest point to xq in
the training set (shown by black polygon on the figure). According to the figure,
the red circle is the closest point. Because K = 1, there is no further point to vote.
Consequently, the algorithm classifies xq as red.
By setting K = 2 the algorithm searches the two closest points which in this case
are one red circle and one blue square. Then, the algorithm counts the votes for each
label. The votes are equal in this example. Hence, the method is not confident with
its decision. For this reason, in practice, we set K to an odd number so one of the
labels always has the majority of votes. If we set K = 3, there will be two votes for
the blue class and one vote for the red class. As the result, xq will be classified as
blue.
We classified every point on the plane using different values of K and X .
Figure 2.3 illustrates the result. The black solid line on the plots shows the border between two regions with different class labels. This border is called decision
boundary. When K = 1 there is always a region around the noisy points, where
they are classified as the red class. However, by setting K = 3 those noisy regions
disappear and they become part the correct class. As the value of K increases, the
decision boundary becomes more smooth and small regions disappear.

2.1 Formulation

19

Fig. 2.3 K-nearest neighbor applied on every point on the plane for different values of K

The original KNN does not take into account the distance of its neighbor when
it counts the votes. In some cases, we may want to weight the votes based on the
distance from neighbors. This can be done by adding a weight term to (2.2):
F(xq ) = arg max



v∈{0,1} p∈N (x )
K q

wi =

wi δ(v, f ( p))

1
.
d(xq , p)

(2.4)

(2.5)

In the above equation, d(.) returns the distance between two vectors. According to this
formulation, the weight of each neighbor is equal to the inverse of its distance from
xq . Therefore, closer neighbors have higher weights. KNN can be easily extended
to datasets with more than two labels without any modifications. However, there

20

2 Pattern Classification

are two important issues with this method. First, finding the class of a query vector
requires to separately compute the distance from all of the samples in training set.
Unless we devise a solution such as partitioning the input space, this can be time and
memory consuming when we are dealing with large datasets. Second, it suffers from
a phenomena called curse of dimensionality. To put it simply, Euclidean distance
becomes very similar in high-dimensional spaces. As the result, if the input of KNN
is a high-dimensional vector then the difference between the closest and farthest
vectors might be very similar. For this reason, it might classify the query vectors
incorrectly.
To alleviate these problems, we try to find a discriminant function in order to
directly model the decision boundary. In other words, a discriminant function models
the decision boundary using training samples in X . A discriminant function could
be a nonlinear function. However, one of the easy ways to model decision boundaries
is linear classifiers.

2.2 Linear Classifier
Assume a binary classification problem in which labels of the d-dimensional input
vector x ∈ Rd can be only 1 or −1. For example, detecting traffic signs in an image
can be formulated as a binary classification problem. To be more specific, given an
image patch, the aim detection is to decide if the image represents a traffic sign or
a non-traffic sign. In this case, images of traffic signs and non-traffic signs might be
indicated using labels 1 and −1, respectively. Denoting the i th element of x with xi ,
it can be classified by computing the following linear relation:
f (x) = w1 x1 + · · · + wi xi + · · · + wd xd + b

(2.6)

where wi is a trainable parameter associated with xi and b is another trainable parameter which is called intercept or bias. The above equation represents a hyperplane
in a d-dimensional Euclidean space. The set of weights {∀i=1...d wi } determines the
orientation of the hyperplane and b indicates the distance of the hyperplane from
origin. We can also write the above equation in terms of matrix multiplications:
f (x) = wxT + b

(2.7)

where w = [w1 , . . . , wd ]. Likewise, it is possible to augment w with b and show all
parameters of the above equation in a single vector ww|b = [b, w1 , . . . , wd ]. With
this formulation, we can also augment x with 1 to obtain xx|1 = [1, x1 , . . . , xd ] and
write the above equation using the following matrix multiplication:
f (x) = ww|b xTx|1 .

(2.8)

2.2 Linear Classifier

21

Fig. 2.4 Geometry of linear models

From now on in this chapter, when we write w, x we are referring to ww|b and xx|1 ,
respectively. Finally, x is classified by applying the sign function on f (x) as follows:
⎧
f (x) > 0
⎨ 1
f (x) = 0
F(x) = N A
(2.9)
⎩
−1
f (x) < 0
In other words, x is classified as 1 if f (x) is positive and it is classified as −1 when
f (x) is negative. The special case happens when f (x) = 0 in which x does not
belong to any of these two classes. Although it may never happen in practice to have
a x such that f (x) is exactly zero, it explains an important theoretical concept which
is called decision boundary. We shall mention this topic shortly. Before, we further
analyze w with respect to x. Clearly, f (x) is zero when x is exactly on the hyperplane.
Considering the fact that w and x are both d + 1 dimensional vectors, (2.8) denotes
the dot product of the two vectors. Moreover, we know from linear algebra that the
dot product of two orthogonal vectors is 0. Consequently, the vector w is orthogonal
to every point on the hyperplane.
This can be studied from another perspective. This is illustrated using a twodimensional example on Fig. 2.4. If we rewrite (2.6) in slope-intercept form, we will
obtain:
w1
b
.
(2.10)
x2 = − x1 −
w2
w2

22

2 Pattern Classification

w1
where the slope of the line is equal to m = − w
. In addition, a line is perpendicular
2
−1

2
to the above line if its slope is equal to m = m = w
w1 . As the result, the weight
vector w = [w1 , w2 ] is perpendicular to the every point on the above line since its
2
slope is equal to w
w1 . Let us have a closer look to the geometry of the linear model.
The distance of point x = [x1 , x2 ] from the linear model can be found by projecting
x − x onto w which is given by:

| f (x)|
w

r=

(2.11)

Here, w refers to the weight vector before augmenting with b. Also, the signed
distance can be obtained by removing the abs (absolute value) operator from the
numerator:
f (x)
rsigned =
.
(2.12)
w
When x is on the line (i.e., a hyperplane in N-dimensional space) then f (x) = 0.
Hence, the distance from the decision boundary will be zero. Set of all points {x |
x ∈ Rd ∧ f (x = 0)} represents the boundary between the regions with labels −1 and
1. This boundary is called decision boundary. However, if x is not on the decision
boundary its distance will be a nonzero value. Also, the sign of the distance depends
on the region that the point falls into. Intuitively, the model is more confident about
its classification when a point is far from decision boundary. In contrary, as it gets
closer to the decision boundary the confidence of the model decreases. This is the
reason that we sometimes call f (x) the classification score of x.

2.2.1 Training a Linear Classifier
According to (2.9), output of a linear classifier could be 1 or −1. This means that
labels of the training data must be also member of set {−1, 1}. Assume we are given
the training set X = {(x0 , y0 ), . . . , (xn , yn )} where xi ∈ Rd is a d-dimensional vector and yi ∈ {−1, 1} showing label of the sample. In order to train a linear classifier,
we need to define an objective function. For any wt , the objective function uses
X to tell how accurate is the f (x) = wt xT at classification of samples in X . The
objective function may be also called error function or loss function. Without the
loss function, it is not trivial to assess the goodness of a model.
Our main goal in training a classification model is to minimize the number of
samples which are classified incorrectly. We can formulate this objective using the
following equation:
L0/1 (w) =

n


H0/1 (wxT , yi )

(2.13)

i=1


H0/1 (wx , yi ) =
T

1
0

wxT × yi < 0
other wise

(2.14)

2.2 Linear Classifier

23

Fig. 2.5 The intuition behind squared loss function is to minimized the squared difference between
the actual response and predicted value. Left and right plots show two lines with different w1 and
b. The line in the right plot is fitted better than the line in the left plot since its prediction error is
lower in total

The above loss function is called 0/1 loss function. A sample is classified correctly
when the sign of wxT and yi are identical. If x is not correctly classified by the
model, the signs of these two terms will not be identical. This means that one of
these two terms will be negative and the other one will be positive. Therefore, their
multiplication will be negative. We see that H0/1 (.) returns 1 when the sample is
classified incorrectly. Based on this explanation, the above loss function counts the
number of misclassified samples. If all samples in X is classified correctly, the
above loss function will be zero. Otherwise, it will be greater than zero. There are
two problems with the above loss function which makes it impractical. First, the
0/1 loss function is nonconvex. Second, it is hard to optimize this function using
gradient-based optimization methods since the function is not continuous at 0 and
its gradient is zero elsewhere.
Instead of counting the number of misclassified samples, we can formulate the
classification problem as a regression problem and use the squared loss function.
This can be better described using a one-dimensional input vector x ∈ R in Fig. 2.5:
In this figure, circles and squares illustrate the samples with labels −1 and 1,
respectively. Since, x is one-dimensional (scaler), the linear model will be f (x) =
w1 x1 + b with only two trainable parameters. This model can be plotted using a line
in a two-dimensional space. Assume the line shown in this figure. Given any x the
output of the function is a real number. In the case of circles, the model should ideally
return −1. Similarly, it should return 1 for all squares in this figure. Notwithstanding,
because f (x) is a linear model f (x1 ) = f (x2 ) if x1 = x2 . This means, it is impossible
that our model returns 1 for every square in this figure. In contrast, it will return a
unique value for each point in this figure.
For this reason, there is an error between the actual output of a point (circle
or square) and the predicted value by the model. These errors are illustrated using

24

2 Pattern Classification

red solid lines in this figure. The estimation error for xi can be formulated as ei =
( f (xi ) − yi ) where yi ∈ {−1, 1} is the actual output of xi as we defined previously
in this section. Using this formulation, we can define the squared loss function as
follows:
n 
n 


Lsq (w) =
(ei )2 =
(wxiT − yi )2 .
(2.15)
i=1

i=1

In this equation, x ∈ Rd is a d-dimensional vector and yi ∈ {−1, 1} is its actual label.
This loss function treat the labels as real number rather than categorical values. This
makes it possible to estimate the prediction error by subtracting predicted values
from actual values. Note from Fig. 2.5 that ei can be a negative or a positive value.
In order to compute the magnitude of ei , we first compute the square of ei and apply
square root in order to compute the absolute value of ei . It should benoted that we
n
n
could define the loss function as i=1
|wxiT − yi | instead of i=1
(wxiT − yi )2 .
However, as we will see shortly, the second formulation has a desirable property
when we utilize a gradient-based optimization to minimize the above loss function.
We can further simplify (2.15). If we unroll the sum operator in (2.15), it will look
like:



Lsq (w) =

(wx1T − y1 )2 + · · · +

(wxnT − yn )2 .

(2.16)

Taking into account the fact that square root is a monotonically increasing function
and it is applied on each term individually, eliminating this operator from the above
equation does not change the minimum of L (w). By applying this on the above
equation, we will obtain:
Lsq (w) =

n


(wxiT − yi )2 .

(2.17)

i=1

Our objective is to minimize the prediction error. In other words:
w = min L (w )
w ∈Rd+1

(2.18)

This is achievable by minimizing Lsq with respect to w ∈ Rd+1 . In order to minimize the above loss function, we can use an iterative gradient-based optimization
method such as gradient descend (Appendix A). Starting with an the initial vector wsol ∈ Rd+1 , this method iteratively changes wsol proportional to the gradient
δL δL
δL
,
, . . . , δw
]. Here, we have shown the intercept using w0
vector L = [ δw
0 δw1
d
instead of b. Consequently, we need to calculate the partial derivative of the loss
function with respect to each of parameters in w as follows:

δL
=2
xi (wxT − yi )
δwi
n

δL
=2
δw0

i=1
n

i=1

∀i = 1 . . . d
(2.19)

(wx − yi )
T

2.2 Linear Classifier

25

One problem with the above equation is that Lsq might be a large value if there
are many training samples in X . For this reason, we might need to use very small
learning rate in the gradient descend method. To alleviate this problem, we can
compute the mean square error by dividing Lsq with the total number of training
samples. In addition, we can eliminate 2 in the partial derivative by multiplying Lsq
by 1/2. The final squared loss function can be defined as follows:
Lsq (w) =

n
1 
(wxiT − yi )2
2n

(2.20)

i=1

with its partial derivatives equal to:
n
δL
1
=
xi (wxT − yi )
δwi
n

δL
1
=
δw0
n

i=1
n


∀i = 1 . . . d
(2.21)

(wx − yi )
T

i=1

Note that the location of minimum of the (2.17) is identical to (2.20). The latter
function is just multiplied by a constant value. However, adjusting the learning rate is
easier when we use (2.20) to find optimal w. One important property of the squared
loss function with linear models is that it is a convex function. This means, the
gradient descend method will always converge at the global minimum regardless of
the initial point. It is worth mentioning this property does not hold if the classification
model is nonlinear function of its parameters. We minimized the square loss function
on the dataset shown in Fig. 2.1. Figure 2.6 shows the status of the gradient descend
in four different iterations.
The background of the plots shows the label of each region according to sign
of classification score computed for each point on the plane. The initial model is
very inaccurate since most of the vectors are classified as red. However, it becomes
more accurate after 400 iterations. Finally, it converges at Iteration 2000. As you
can see, the amount of change in the first iterations is higher than the last iterations.
By looking at the partial derivatives, we realize that the change of a parameter is
directly related to the prediction error. Because the prediction error is high in the
first iterations, parameters of the model changes considerably. As the error reduces,
parameters also change slightly. The intuition behind the least square loss function
can be studied from another perspective.
Assume the two hypothetical lines parallel to the linear model shown in Fig. 2.7.
The actual distance of these lines from the linear model is equal to 1. In the case
of negative region, the signed distance of the hypothetical line is −1. On the other
hand, we know from our previous discussion that the normalized distance of samples
f (x)
where, here, w refers to the parameter
x from the decision boundary is equal to w
vector before augmenting. If consider the projection of x on w and utilize the fact that
wx = wx cos(θ ), we will see that the unnormalized distance of sample x from

26

2 Pattern Classification

Fig. 2.6 Status of the gradient descend in four different iterations. The parameter vector w changes
greatly in the first iterations. However, as it gets closer to the minimum of the squared loss function,
it changes slightly

the linear model is equal to f (x). Based on that, least square loss tries to minimize
the sum of unnormalized distance of samples from their actual hypothetical line.
One problem with least square loss function is that it is sensitive to outliers. This is
illustrated using an example on Fig. 2.8. In general, noisy samples do not come from
the same distribution as clean samples. This means that they might not be close to
clean samples in the d-dimensional space. On the one hand, square loss function tries
to minimize the prediction error between the samples. On the other hand, because the
noisy samples are located far from the clean samples, they have a large prediction
error. For this reason, some of the clean samples might be sacrificed in order to
reduce the error with the noisy sample. We can see in this figure that because of
noisy sample, the model is not able to fit on the data accurately.

2.2 Linear Classifier

27

Fig. 2.7 Geometrical intuition behind least square loss function is to minimize the sum of unnormalized distances between the training samples xi and their corresponding hypothetical line

Fig. 2.8 Square loss function may fit inaccurately on training data if there are noisy samples in the
dataset

It is also likely in practice that clean samples form two or more separate clusters
in the d-dimensional space. Similar to the scenario of noisy samples, squared loss
tries to minimize the prediction error of the samples in the far cluster as well. As we
can see on the figure, the linear model might not be accurately fitted on the data if
clean samples form two or more separate clusters.

28

2 Pattern Classification

This problem is due to the fact that the squared loss does not take into account
the label of the prediction. Instead, it considers the classification score and computes
the prediction error. For example, assume the training pairs:
{(xa , 1), (xb , 1), (xc , −1), (xd , −1)}

(2.22)

Also, suppose two different configurations w1 and w2 for the parameters of the linear
model with the following responses on the training set:
f w1 (xa )
f w1 (xb )
f w1 (xc )
f w1 (xd )
Lsq (w1 )

= 10
=1
= −0.5
= −1.1
= 10.15

f w2 (xa ) = 5
f w2 (xb ) = 2
f w2 (xc ) = 0.2
f w2 (xd ) = −0.5
Lsq (w1 ) = 2.33

(2.23)

In terms of squared loss, w2 is better than w1 . But, if we count the number of
misclassified samples we see that w1 is the better configuration. In classification
problems, we are mainly interested in reducing the number of incorrectly classified
samples. As the result, w1 is favorable to w2 in this setting. In order to alleviate
this problem of squared loss function we can define the following loss function to
estimate 0/1 loss:
n

Lsg (w) =
1 − sign( f (xi ))yi .
(2.24)
i=1

If f (x) predicts correctly, its sign will be identical to the sign of yi in which their
multiplication will be equal to +1. Thus, the outcome of 1 − sign( f (xi ))yi will
be zero. In contrary, if f (x) predicts incorrectly, its sign will be different from yi .
So, their multiplication will be equal to −1. That being the case, the result of 1 −
sign( f (xi ))yi will be equal to 2. For this reason, wsg returns the twice of number
of misclassified samples.
The above loss function look intuitive and it is not sensitive to far samples. However, finding the minimum of this loss function using gradient-based optimization
methods is hard. The reason is because of sign function. One solution to solve this
problem is to approximate the sign function using a differentiable function. Fortunately, tanh (Hyperbolic tangent) is able to accurately approximate the sign function.
More specifically, tanh(kx) ≈ sign(x) when k
1. This is illustrated in Fig. 2.9.
As k increases, the tanh function will be able to approximate the sign function more
accurately.
By replacing the sign function with tanh in (2.24), we will obtain:
Lsg (w) =

n

i=1

1 − tanh(k f (xi ))yi .

(2.25)

2.2 Linear Classifier

29

Fig. 2.9 The sign function
can be accurately
approximated using tanh(kx)
when k
1

Similar to the squared loss function, the sign loss function can be minimized using
the gradient descend method. To this end, we need to compute the partial derivatives
of the sign loss function with respect to its parameters:
δ Lsg (w)
= −kxi y(1 − tanh2 (k f (x)))
wi
δ Lsg (w)
= −ky(1 − tanh2 (k f (x)))
w0

(2.26)

If we train a linear model using the sign loss function and the gradient descend
method on the datasets shown in Figs. 2.1 and 2.8, we will obtain the results illustrated
in Fig. 2.10. According to the results, the sign loss function is able to deal with
separated clusters of samples and outliers as opposed to the squared loss function.
Even though the sign loss using the tanh approximation does a fairly good job on
our sample dataset, it has one issue which makes the optimization slow. In order to
explain this issue, we should study the derivative of tanh function. We know from
= 1 − tanh2 (x). Figure 2.11 shows its plot. We can see that the
calculus that δ tanh(x)
δx
derivative of tanh saturates as |x| increases. Also, it saturates more rapidly if we set k
to a positive number greater than 1. On the other hand, we know from (2.26) that the
gradient of the sign loss function directly depends on the derivative of tanh function.
That means if the derivative of a sample falls into the saturated region, its magnitude
is close to zero. As a consequence, parameters change very slightly. This phenomena
which is called the saturated gradients problem slows down the convergence speed
of the gradient descend method. As we shall see in the next chapters, in complex
models such as neural networks with millions of parameters, the model may not

30

2 Pattern Classification

Fig. 2.10 The sign loss function is able to deal with noisy datasets and separated clusters problem
mentioned previously
Fig. 2.11 Derivative of
tanh(kx) function saturates
as |x| increases. Also, the
ratio of saturation growth
rapidly when k > 1

be able to adjust the parameters of initial layers since the saturated gradients are
propagated from last layers back to the first layers.

2.2.2 Hinge Loss
Earlier in this chapter, we explained that the normalized distance of sample x from the
f (x)|
. Likewise, margin of x is obtained by computing
decision boundary is equal to | w
T
(wx )y where y is the corresponding label of x. The margin tell us how correct is the
classification of the sample. Assume that the label of xa is −1. If wxaT is negative,
its multiplication with y = −1 will be positive showing that the sample is classified
correctly with a confidence analogous to |wxT |. Likewise, if wxaT is positive, its

2.2 Linear Classifier

31

Fig. 2.12 Hinge loss increases the margin of samples while it is trying to reduce the classification
error. Refer to text for more details

multiplication with y = −1 will be negative showing that the sample is classified
incorrectly with a magnitude equal to |wxT |.
The basic idea behind hinge loss is not only to train a classifier but also to increase
margin of samples. This is an important property which may increase tolerance of
the classifier against noisy samples. This is illustrated in Fig. 2.12 on a synthetic
dataset which are perfectly separable using a line. The solid line shows the decision
boundary and the dashed lines illustrate the borders of the critical region centered
at the decision boundary of this model. It means that the margin of samples in this
region is less than |a|. In contrast, margin of samples outside this region is high
which implies that the model is more confident in classification of samples outside
this region. Also, the colorbar next to each plots depicts the margin corresponding
to each color on the plots.

32

2 Pattern Classification

In the first plot, two test samples are indicated which are not used during the
training phase. One of them belongs to circles and the another one belongs to squares.
Although the line adjusted on the training samples is able to perfectly discriminate
the training samples, it will incorrectly classify the test red sample. Comparing the
model in the second plot with the first plot, we observe that fewer circles are inside
the critical region but the number of squares increase inside this region. In the third
plot, the overall margin of samples are better if we compare the samples marked with
white ellipses on these plots. Finally, the best overall margin is found in the fourth
plot where the test samples are also correctly classified.
Maximizing the margin is important since it may increase the tolerance of model
against noise. The test samples in Fig. 2.12 might be noisy samples. However, if the
margin of the model is large, it is likely that these samples are classified correctly.
Nonetheless, it is still possible that we design a test scenario where the first plot
could be more accurate than the fourth plot. But, as the number of training samples
increases a classifier with maximum margin is likely to be more stable. Now, the
question is how we can force the model by a loss function to increase its accuracy
and margin simultaneously? The hinge loss function achieves these goals using the
following relation:
Lhinge (w) =

n
1
max(0, a − wxiT yi )
n

(2.27)

i=1

where yi ∈ {−1, 1} is the label of the training sample xi . If signs of wxi and yi
are equal, the term inside the sum operator will return 0 since the value of the
second parameter in the max function will be negative. In contrast, if their sign
are different, this term will be equal to a − wxiT yi increasing the value of loss.
Moreover, if wxiT yi < a this implies that x is within the critical region of the model
and it increases the value of loss. By minimizing the above loss function we will
obtain a model with maximum margin and high accuracy at the same time. The term
inside the sum operator can be written as:
max(0, a

− wxiT yi )


a − wxiT yi
=
0

wxiT yi < a
wxiT yi ≥ a

(2.28)

Using this formulation and denoting max(0, a − wxiT yi ) with H , we can compute
the partial derivatives of Lhinge (w) with respect to w:

δH
−xi yi
=
0
δwi

δH
−yi
=
0
δw0

wxiT yi < a
wxiT yi ≥ a
wxiT yi < a
wxiT yi ≥ a

(2.29)

2.2 Linear Classifier

33

Fig. 2.13 Training a linear classifier using the hinge loss function on two different datasets

n
δ Lhinge (w)
1  δH
=
δwi
n
wi
i=1

n
δ Lhinge (w)
1  δH
=
δw0
n
w0

(2.30)

i=1

It should be noted that, Lhinge (w) is not continuous at wxiT yi = a and, consequently, it is not differentiable at wxiT yi = a. For this reason, the better choice for
optimizing the above function might be a subgradient-based method. However, it
might never happen in a training set to have a sample in which wxiT yi is exactly equal
to a. For this reason, we can still use the gradient descend method for optimizing
this function.
Furthermore, the loss function does not depend on the value of a. It only affects
the magnitude of w. In other words, w is always adjusted such that as few training
samples as possible fall into the critical region. For this reason, we always set a = 1
in practice. We minimized the hinge loss on the dataset shown in Figs. 2.1 and 2.8.
Figure 2.13 illustrates the result. As before, the region between the two dashed lines
indicates the critical region.
Based on the results, the model learned by the hinge loss function is able to deal
with separated clusters problem. Also, it is able to learn an accurate model for the
nonlinearly separable dataset. A variant of hinge loss called squared hinge loss has
been also proposed which is defined as follows:
Lhinge (w) =

n
1
max(0, 1 − wxiT yi )2
n
i=1

(2.31)

34

2 Pattern Classification

The main difference between the hinge loss and the squared hinge loss is that the
latter one is smoother and it may make the optimization easier. Another variant of
the hinge loss function is called modified Huber and it is defined as follows:

Lhuber (w) =

max(0, 1 − ywxT )2
−4ywxT

ywxT ≥ −1
other wise

(2.32)

The modified Huber loss is very close to the squared hinge and they may only differ
in the convergence speed. In order to use any of these variants to train a model,
we need to compute the partial derivative of the loss functions with respect to their
parameters.

2.2.3 Logistic Regression
None of the previously mentioned linear models are able to compute the probability of
samples x belonging to class y = 1. Formally, given a binary classification problem,
we might be interested in computing p(y = 1|x). This implies that p(y = −1|x) =
1 − p(y = 1|x). Consequently, the sample x belongs to class 1 if p(y = 1|x) > 0.5.
Otherwise, it belongs to class -1. In the case that p(y = 1|x) = 0.5, the sample is
exactly on the decision boundary and it does not belong to any of these two classes.
The basic idea behind logistic regression is to learn p(y = 1|x) using a linear model.
To this end, logistic regression transforms the score of a sample into probability by
passing the score through a sigmoid function. Formally, logistic regression computes
the posterior probability as follows:
p(y = 1|x; w) = σ (wxT ) =

1
1 + e−wx

T

.

(2.33)

In this equation, σ : R → [0, 1] is the logistic sigmoid function. As it is shown in
Fig. 2.14, the function has a S shape and it saturates as |x| increases. In other words,
derivative of function approaches to zero as |x| increases.
Since range of the sigmoid function is [0, 1] it satisfies requirements of a probability measure function. Note that (2.33) directly models the posterior probability
which means by using appropriate techniques that we shall explain later, it is able
to model likelihood and a priori of classes. Taking into account the fact that (2.33)
returns the probability of a sample, the loss function must be also build based on
probability of the whole training set given a specific w. Formally, given a dataset of
n training samples, our goal is to maximize their joint probability which is defined
as:
n

Llogistic (w) = p(x1 ∩ x2 ∩ · · · ∩ xn ) = p

xi .

(2.34)

i=1

Modeling the above joint probability is not trivial. However, it is possible to decompose this probability into smaller components. To be more specific, the probability

2.2 Linear Classifier

35

Fig. 2.14 Plot of the sigmoid function (left) and logarithm of the sigmoid function (right). The
domain of the sigmoid function is real numbers and its range is [0, 1]

of xi does not depend on the probability of x j . For this reason and taking into
account the fact that p(A, B) = p(A) p(B) if A and B are independent events, we
can decompose the above joint probability into product of probabilities:
n

Llogistic (w) =

p(yi |xi )

(2.35)

i=1

where p(xi ) is computed using:

p(y = 1|x; w)
p(yi |xi ) =
1 − p(y = 1|x; w)

yi == 1
yi == −1

(2.36)

Representing the negative class with 0 rather than −1, the above equation can be
written as:
p(xi ) = p(y = 1|x; w) yi (1 − p(y = 1|x; w))1−yi .

(2.37)

This equation which is called Bernoulli distribution is used to model random variables
with two outcomes. Plugging (2.33) into the above equation we will obtain:


n

Llogistic (w) =


σ (wxT ) yi (1 − σ (wxT ))1−yi .

(2.38)

i=1


Optimizing the above function is hard. The reason is because of operator which
makes the derivative of the loss function intractable. However, we can apply logarithm
trick to change the multiplication into summation. In other words, we can compute
log(Llogistic (w)):
n

log(Llogistic (w)) = log
i=1



σ (wxT ) yi (1 − σ (wxT ))1−yi



.

(2.39)

36

2 Pattern Classification

We know from properties of logarithm that log(A × B) = log(A) + log(B). As the
result, the above equation can be written as:
log(Llogistic (w)) =

n


yi log σ (wxT ) + (1 − yi ) log(1 − σ (wxT )).

(2.40)

i=1

If each sample in the training set is classified correctly, p(xi ) will be close to 1 and
if it is classified incorrectly, it will be close to zero. Therefore, the best classification
will be obtained if we find the maximum of the above function. Although this can
be done using gradient ascend methods, it is preferable to use gradient descend
methods. Because gradient descend can be only applied on minimization problems,
we can multiply both sides of the equation with −1 in order to change the maximum
of the loss into minimum:
E = − log(Llogistic (w)) = −

n


yi log σ (wxT ) + (1 − yi ) log(1 − σ (wxT )).

i=1

(2.41)
Now, we can use gradient descend to find the minimum of the above loss function.
This function is called cross-entropy loss. In general, these kind of loss functions
are called negative log-likelihood functions. As before, we must compute the partial
derivatives of the loss function with respect to its parameters in order to apply the
gradient descend method. To this end, we need to compute the derivative of σ (a)
with respect to its parameter which is equal to:
δσ (a)
= σ (a)(1 − σ (a)).
a

(2.42)

Then, we can utilize the chain rule to compute the partial derivative of the above loss
function. Doing so, we will obtain:


δE
= σ (wxiT ) − yi xi
wi
δE
= σ (wxiT ) − yi
w0

(2.43)

Note that in contrast to the previous loss functions, here, yi ∈ {0, 1}. In other words,
the negative class is represented using 0 instead of −1. Figure 2.15 shows the result of
training linear models on the two previously mentioned datasets. We see that logistic
regression is find an accurate model even when the training samples are scattered in
more than two clusters. Also, in contrast to the squared function, it is less sensitive
to outliers.
It is possible to formulate the logistic loss with yi ∈ {−1, 1}. In other words, we
can represent the negative class using −1 and reformulate the logistic loss function.

2.2 Linear Classifier

37

Fig. 2.15 Logistic regression is able to deal with separated clusters

More specifically, we can rewrite the logistic equations as follows:
p(y = 1|x) =

1
1 + e−wx

T

p(y = −1|x) = 1 − p(y = 1|x) =

(2.44)

1
1 + e+wx

T

This implies that:
p(yi |xi ) =

1

1 + e−yi wx
Plugging this in (2.35) and taking the negative logarithm, we will obtain:
Llogistic (w) =

n


T

log(1 + e−yi wx )
T

(2.45)

(2.46)

i=1

It should be noted that (2.41) and (2.46) are identical and they can lead to the same
solution. Consequently, we can use any of them to fit a linear model. As before, we
only need to compute partial derivatives of the loss function and use them in the
gradient descend method to minimize the loss function.

2.2.4 Comparing Loss Function
We explained 7 different loss functions for training a linear model. We also discussed
some of their properties in presence of outliers and separated clusters. In this section,
we compare these loss functions from different perspectives. Table 2.1 compares
different loss functions. Besides, Fig. 2.16 illustrates the plot of the loss functions
along with their second derivative.

38

2 Pattern Classification

Table 2.1 Comparing different loss functions
Loss function

Equation

Zero-one loss

L0/1 (w) =

Squared loss

Lsq (w) =

Tanh Squared loss

wsg =

Hinge loss
Squared hinge loss

No
Yes

n
i=1 1 − tanh(k f (xi ))yi .
n
Lhinge (w) = n1 i=1
max(0, 1 − wxiT yi )
n
Lhinge (w) = n1 i=1
max(0, 1 − wxiT yi )2

Modified Huber
Lhuber (w) =

Logistic loss

Convex
n
T
i=1 H0/1 (wx , yi )
n
T
2
i=1 (wxi − yi )


max(0, 1 − ywxT )2

ywxT ≥ −1

−4ywxT

other wise

− log(Llogistic (w)) =
n
yi log σ (wxT ) + (1 − yi ) log(1 − σ (wxT ))
− i=1

No
Yes
Yes
Yes

Yes

Fig. 2.16 Tanh squared loss and zero-one loss functions are not convex. In contrast, the squared
loss, the hinge loss, and its variant and the logistic loss functions are convex

Informally, a one variable function is convex if for every pair of points x and
y, the function falls below their connecting line. Formally, if the second derivative
of a function is positive, the function is convex. Looking at the plots of each loss
function and their derivatives, we realize that the Tanh squared loss and the zero-one
loss functions are not convex. In contrast, hinge loss and its variants as well as the
logistic loss are all convex functions. Convexity is an important property since it
guarantees that the gradient descend method will find the global minimum of the
function provided that the classification model is linear.
Let us have a closer look at the logistic loss function on the dataset which is
linearly separable. Assume the parameter vector ŵ such that two classes are separated perfectly. This is shown by the top-left plot in Fig. 2.17. However, because the
magnitude of ŵ is low σ (wxT ) is smaller than 1 for the points close to the decision
boundary. In order to increase the value of σ (wxT ) without affecting the classification accuracy, the optimization method may increase the magnitude of ŵ. As we can
see in the other plots, as the magnitude increases, the logistic loss reduces. Magnitude
of ŵ can increase infinitely resulting the logistic to approach zero.

2.2 Linear Classifier

39

Fig. 2.17 Logistic regression tries to reduce the logistic loss even after finding a hyperplane which
discriminates the classes perfectly

However, as we will explain in the next chapter, parameter vectors with high
magnitude may suffer from a problem called overfitting. For this reason, we are
usually interested in finding parameter vectors with low magnitudes. Looking at the
plot of the logistic function in Fig. 2.16, we see that the function approaches to zero
at infinity. This is the reason that the magnitude of model increases.
We can analyze the hinge loss function from the same perspective. Looking at
the plot of the hinge loss function, we see that it becomes zero as soon as it finds a
hyperplane in which all the samples are classified correctly and they are outside the
critical region. We fitted a linear model using the hinge loss function on the same
dataset as the previous paragraph. Figure 2.18 shows that after finding a hyperplane
that classifies the samples perfectly, the magnitude of w increases until all the samples
are outside the critical region. At this point, the error becomes zero and w does not
change anymore. In other words, w has an upper bound when we find it using the
hinge loss function.

40

2 Pattern Classification

Fig. 2.18 Using the hinge loss function, the magnitude of w changes until all the samples are
classified correctly and they do not fall into the critical region

The above argument about the logistic regression does not hold when the classes
are not linearly separable. In other words, in the case that classes are nonlinearly
separable, it is not possible to perfectly classify all the training samples. Consequently, some of the training samples are always classified incorrectly. In this case,
as it is shown in Fig. 2.19, if w increases, the error of the misclassified samples
also increases resulting in a higher loss. For this reason, the optimization algorithm
change the value of w for a limited time. In other words, there could be an upper
bound for w when the classes are not linearly separable.

2.3 Multiclass Classification

41

Fig. 2.19 When classes are not linearly separable, w may have an upper bound in logistic loss
function

2.3 Multiclass Classification
In the previous section, we mentioned a few techniques for training a linear classifier
on binary classification problems. Recall from the previous section that in a binary
classification problem our goal is to classify the input x ∈ Rd into one of two classes.
A multiclass classification problem is a more generalized concept in which x is
classified into more than two classes. For example, suppose we want to classify 10
different speed limit signs starting from 30 to 120 km/h. In this case, x represents
the image of a speed limit sign. Then, our goal is to find the model f : Rd → Y
where Y = {0, 1, . . . , 9}. The model f (x) accepts a d-dimensional real vector and
returns a categorical integer between 0 and 9. It is worth mentioning that Y is not
an ordered set. It can be any set with 10 different symbols. However, for the sake of
simplicity, we usually use integer numbers to show classes.

2.3.1 One Versus One
A multiclass classifier can be build using a group of binary classifiers. For instance,
assume the 4-class classification problem illustrated in Fig. 2.20 where
Y = {0, 1, 2, 3}. One technique for building a multiclass classifier using a group
of binary classifier is called one-versus-one (OVO).
Given the dataset X = {(x0 , y0 ), . . . , (xn , yn )} where xi ∈ Rd and yi ∈
{0, 1, 2, 3}, we first pick the samples from X with label 0 or 1. Formally, we create
the following dataset:
X0|1 = {xi | xi ∈ X ∧ yi ∈ {0, 1}}

(2.47)

42

2 Pattern Classification

Fig. 2.20 A samples dataset
including four different
classes. Each class is shown
using a unique color and
shape

and a binary classifier is fitted on X0|1 . Similarly, X0|2 , X0|3 , X1|2 , X1|3 and X2|3
are created a separate binary classifiers are fitted on each of them. By this way,
there will be six binary classifiers. In order to classify the new input xq into one of
four classes, it is first classified using each of these 6 classifiers. We know that each
classifier will yield an integer number between 0 and 3. Since there are six classifiers,
one of the integer numbers will be repeated more than others. The class of xq is equal
to the number with highest occurrence. From another perspective, we can think of
the output of each binary classifier as a vote. Then, the winner class is the one with
majority of votes. This method of classification is called majority voting. Figure 2.21
shows six binary classifiers trained on six pairs of classes mentioned above. Besides,
it illustrates how points on the plane are classified into one of four classes using this
technique.
This example can be easily extended to a multiclass classification problem with N
classes. More specifically, all pairs of classes Xa|b are generated for all a = 1 . . . N −
1 and b = a + 1 . . . N . Then, a binary model f a|b is fitted on the corresponding
dataset. By this way, N (N2−1) binary classifiers will be trained. Finally, an unseen
sample xq is classified by computing the majority of votes produces by all the binary
classifiers.
One obvious problem of one versus one technique is that the number of binary
classifiers quadratically increases with the number of classes in a dataset. This
means that using this technique we need to train 31125 binary classifiers for a
250-class classification problem such as traffic sign classification. This makes the
one versus one approach impractical for large values of N . In addition, sometimes ambiguous results might be generated by one versus one technique. This may
happen when there are two or more classes with majority of votes. For example,

2.3 Multiclass Classification

43

44

2 Pattern Classification

 Fig. 2.21 Training six classifiers on the four class classification problem. One versus one technique

considers all unordered pairs of classes in the dataset and fits a separate binary classifier on each pair.
A input x is classified by computing the majority of votes produced by each of binary classifiers.
The bottom plot shows the class of every point on the plane into one of four classes

assume that the votes of 6 classifiers in the above example for an unseen sample
are 1, 1, 2, and 2 for classes 0, 1, 2, and 3, respectively. In this case, the Class 2 and
Class 3 have equally the majority votes. Consequently, the unseen sample cannot be
classified. This problem might be addressed by taking into account the classification
score (i.e., wxT ) produced by the binary classifiers. However, the fact remains that
one versus one approach is not practical in applications with many classes.

2.3.2 One Versus Rest
Another popular approach for building a multiclass classifier using a group of binary
classifiers is called one versus rest (OVR). It may also be called one versus all or one
against all approach. As opposed to one versus one approach where N (N2−1) binary
classifiers are created for a N-class classification problem, one versus rest approach
trains only N binary classifiers to make predictions. The main difference between
these two approaches are the way that they create the binary datasets.
In one versus rest technique, a binary dataset for class a is created as follows:
Xa|r est = {(xi , 1)|xi ∈ X ∧ yi = a} ∪ {(xi , −1)|xi ∈ X ∧ yi = a}.

(2.48)

Literally, Xa|r est is composed of all the samples in X . The only difference is
the label of samples. For creating Xa|r est , we pick all the samples in X with label
a and add them to Xa|r est after changing their label to 1. Then, the label of all the
remaining samples in X is changed to −1 and they are added to Xa|r est . For a
N-class classification problem, Xa|r est is generated for all a = 1 . . . N . Finally, a
binary classifier f a|r est (x) is trained on each Xa|r est using the method we previously
mentioned in this chapter. An unseen sample xq is classified by computing:
ŷq = arg max f a|r est (xq ).

(2.49)

a=1...N

In other words, the score of all the classifiers are computed. The classifier with
the maximum score shows the class of the sample xq . We applied this technique
on the dataset shown in Fig. 2.20. Figure 2.22 illustrates how the binary datasets
are generated. It also shows how every point on the plane are classified using this
technique.
Comparing the results from one versus one and one versus all, we observe that
they are not identical. One advantage of one versus rest over one versus one approach
is that the number of binary classifiers increases linearly with the number of classes.

2.3 Multiclass Classification

45

Fig. 2.22 One versus rest approach creates a binary dataset by changing the label of the class-ofinterest to 1 and the label of the other classes to −1. Creating binary datasets is repeated for all
classes. Then, a binary classifier is trained on each of these datasets. An unseen sample is classified
based on the classification score of the binary classifiers

For this reason, one versus rest approach is practical even when the number of classes
is high. However, it posses another issue which is called imbalanced dataset.
We will talk throughly about imbalanced datasets later in this book. But, to give an
insight about this problem, consider a 250-class classification problem where each
class contains 1000 training samples. This means that the training dataset contains
250,000 samples. Consequently, Xa|r est will contain 1000 samples with label 1

46

2 Pattern Classification

(positive samples) and 249,000 samples with label −1 (negative samples). We know
from previous section that a binary classifier is trained by minimizing a loss function.
However, because the number of negative samples is 249 times more than the samples
with label 1, the optimization algorithm will in fact try to minimize the loss occurred
by the negative samples. As the result, the binary model might be highly biased
toward negative samples and it might classify most of unseen positive samples as
negative samples. For this reason, one versus rest approach usually requires a solution
to tackle with highly imbalanced dataset Xa|r est .

2.3.3 Multiclass Hinge Loss
An alternative solution to one versus one and one versus all techniques is to partition
the d-dimensional space into N distinct regions using N linear models such that:
L0/1 (W) =

n


H (x, yi )

i=1


0
H (x, yi ) =
1

yi = arg max j=1...N f i (xi )
other wise

(2.50)

is minimum for all the samples in the training dataset. In this equation, W ∈ R N ×d+1
is a weight matrix indicating the weights (d weights for each linear model) and biases
(1 bias for each linear model) of N linear models. Also, xi ∈ Rd is defined as before
and yi ∈ {1, . . . , N } can take any of the categorical integer values between 1 and N
and it indicates the class of xi . This loss function is in fact the generalization of the
0/1 loss function into N classes. Here also the objective of the above loss function is
to minimize the number of incorrectly classified samples. After finding the optimal
weight matrix W∗ , an unseen sample xq is classified using:
ŷq = arg max f i (xq ; Wi∗ )

(2.51)

i=1...N

where Wi∗ depicts the i th row of the weight matrix. The weight matrix W∗ might
be found by minimizing the above loss function. However, optimizing this function
using iterative gradient methods is a hard task. Based on the above equation, the
sample xc belonging to class c is classified correctly if:
∀ j=1...N ∧ j=i Wc xi > W j xi .

(2.52)

In other words, the score of the cth model must be greater than all other models so
xc is classified correctly. By rearranging the above equation, we will obtain:
∀ j=1...N ∧ j=i W j xi − Wc xi ≤ 0.

(2.53)

2.3 Multiclass Classification

47

Assume that W j xi is fixed. As Wc xi increases, their difference becomes more negative. In contrast, if the sample is classified incorrectly, their difference will be greater
than zero. Consequently, if:
max

j=1...N ∧ j=i

W j xi − Wc xi

(2.54)

is negative, the sample is classified correctly. In contrary, if it is positive the sample
is misclassified. In order to increase the stability of the models we can define the
margin ε ∈ R+ and rewrite the above equation as follows:
H (xi ) = ε +

max

j=1...N ∧ j=i

W j xi − Wc xi .

(2.55)

The sample is classified correctly if H (xi ) is negative. The margin variable ε eliminates the samples which are very close to the model. Based on this equation, we can
define the following loss function:
L (W) =

n


max(0, ε + max W j xi − Wc xi ).
j=i

i=1

(2.56)

This loss function is called multiclass hinge loss. If the sample is classified correctly
and it is outside the critical region, ε + max j=1...N and j=i W j xi − Wc xi will be negative. Hence, output of max(0, −) will be zero indicating that we have not made a
loss on xi using the current value for W. Nonetheless, if the sample is classified in
correctly or it is within the critical region ε + max j=1...N and j=i W j xi − Wc xi will
be a positive number. As the result, max(0, +) will be positive indicating that we
have made a loss on xi . By minimizing the above loss function, we will find W such
that the number misclassified samples is minimum.
The multiclass hinge loss function is a differentiable function. For this reason,
gradient-based optimization methods such as gradient descend can be used to find
the minimum of this function. To achieve this goal, we have to find the partial
derivatives of the loss function with respect to each of the parameters in W. Given a
sample xi and its corresponding label yi , partial derivatives of (2.56) with respect to
Wm,n is calculated a follows:
⎧
x
δ L (W; (xi , yi )) ⎨ n
= −xn
⎩
δWm,n
0

ε + Wm xi − W yi xi > 0 and m = arg max p= yi W p xi − W yi xi
ε + max p=m W p xi − Wm xi > 0 and m = yi
other wise

(2.57)
δ L (W)  δ L (W; (xi , yi ))
=
δWm,n
δWm,n
n

(2.58)

i=1

In these equations, Wm,n depicts the n th parameter of the m th model. Similar to
the binary hinge loss, ε can be set to 1. In this case, the magnitude of the models
will be adjusted such that the loss function is minimum. If we plug the above partial

48

2 Pattern Classification

Fig. 2.23 A two-dimensional space divided into four regions using four linear models fitted using
the multiclass hinge loss function. The plot on the right shows the linear models (lines in twodimensional case) in the space

derivatives into the gradient descend method and apply it on the dataset illustrated
in Fig. 2.20, we will obtain the result shown in Fig. 2.23.
The left plot in this figure shows how the two-dimensional space is divided into
four distinct regions using the four linear models. The plot on the right also illustrates
the four lines in this space. It should be noted that it is the maximum score of a sample
from all the models that determined the class of the sample.

2.3.4 Multinomial Logistic Function
In the case of binary classification problems, we are able to model the probability
of x using the logistic function in (2.33). Then, a linear model can be found by
maximizing the joint probability of training samples. Alternatively, we showed in
(2.46) that we can minimize the negative of logarithm of probabilities to find a linear
model for a binary classification problem.
It is possible to extend the logistic function into a multiclass classification problem.
We saw before that N classes can be discriminated using N different lines. In addition,
we showed how to model the posterior probability of input x using logistic regression
in (2.33). Instead of modeling p(y = 1|x; w), we can alternatively model ln p(y =
1|x; w) given by:
ln p(y = 1|x; w) = wxT − ln Z

(2.59)

where ln Z is a normalization factor. This model is called log-linear model. Using this
formulation, we can model the posterior probability of N classes using N log-linear

2.3 Multiclass Classification

49

models:
ln p(y = 1|x; w1 ) = w1 xT − ln Z
ln p(y = 2|x; w2 ) = w2 xT − ln Z
...

(2.60)

ln p(y = N |x; wn ) = w N xT − ln Z
If we compute the exponential of the above equations we will obtain:
T

ew1 x
Z
T
w
e 2x
p(y = 2|x; w2 ) =
Z
...
p(y = 1|x; w1 ) =

ew N x
p(y = N |x; w N ) =
Z

(2.61)

T

We know from probability theory that:
N


p(y = c|x; w1 ) = 1

(2.62)

c=1

Using this property, we can find the normalization factor Z that satisfies the above
condition. If we set:
T

T

T

ew1 x
ew2 x
ew N x
+
+ ··· +
=1
Z
Z
Z

(2.63)

as solve the above equation for Z , we will obtain:
Z=

N


ewi x

T

(2.64)

i=1

Using the above normalization factor and given the sample xi and its true class c,
the posterior probability p(y = c|xi ) is computed by:
T

p(y = c|xi ) =

ewc xi

N
w j xiT
j=1 e

(2.65)

where N is the number of classes. The denominator in the above equation is a
N
normalization factor so c=1
p(y = c|xi ) = 1 holds true and, consequently, p(y =
c|xi ) is a valid probability function. The above function which is called softmax
function is commonly used to train convolutional neural networks. Given, a dataset

50

2 Pattern Classification

of d-dimensional samples xi with their corresponding labels yi ∈ {1, . . . N } and
assuming the independence relation between the samples (see Sect. 2.2.3), likelihood
of all samples for a fixed W can be written as follows:
n

p(X ) =

p(y = yi |xi ).

(2.66)

i=1

As before, instead of maximizing the likelihood, we can minimize the negative of
log-likelihood that is defined as follows:
− log( p(X )) = −

n


log( p(y = yi |xi )).

(2.67)

i=1

Note that the product operator has changed to the summation operator taking into
account the fact that log(ab) = log(a) + log(b). Now, for any W we can compute
the following loss:
Lso f tmax (W) = −

n


log(yc )

(2.68)

i=1

where W ∈ R N ×d+1 represents the parameters for N linear models and yc = p(y =
yi |xi ). Before computing the partial derivatives of the above loss function, we explain
how to show the above loss function using a computational graph. Assume computing
log(yc ) for a sample. This can be represented using the graph in Fig. 2.24.
Computational graph is a directed acyclic graph where each non-leaf node in this
graph shows a computational unit which accepts one or more inputs. Leaves also
show the input of the graph. The computation starts from the leaves and follows the
direction of the edges until it reaches to the final node. We can compute the gradient
of each computational node with respect to its inputs. The labels next to each edge
shows the gradient of its child node (top) with respect to its parent node (bottom).
Assume, we want to compute δ L /δW1 . To this end, we have to sum all the paths
from L to W1 and multiply the gradients represented by edges along each path. This
result will be equivalent to multivariate chain rule. According to this, δ L /δW1 will
be equal to:
δL
δ L δyc δz 1
=
.
δW1
δyc δz 1 δW1

(2.69)

Using this concept, we can easily compute δ L /δWi, j where Wi, j refers to the j th
c
parameter of u th model. For this purpose, we need to compute δy
δz i which is done as
follows:
⎧ zc
zm
zc zc
e
ezc
m e −e e
⎪
δ
= yc (1 − yc ) i = c
⎨
N
2
zm
z
m
δyc
e
e
( m )
m=1
=
=
(2.70)
z
zc
⎪
δz i
δz i
⎩ −e i ezm 2 = yi yc
i = c
( me )

2.3 Multiclass Classification

51

Fig. 2.24 Computational
graph of the softmax loss on
one sample

Now, we can compute δ L /δWi, j by plugging the above derivative into the chain
rule obtained by the computational graph for sample x with label yc .

δL
−(1 − yc )x j i = c
(2.71)
=
δWi, j
yi x j
i = c
With this formulation, the gradient of all the samples will be equal to sum of the
gradient of each sample. Now, it is possible to minimize the softmax loss function
using the gradient descend method. Figure 2.25 shows how the two-dimensional
space in our example is divided into four regions using the models trained by the
softmax loss function. Comparing the results from one versus one, one versus all,
the multiclass hinge loss and the softmax loss, we realize that their results are not
identical. However, the two former techniques is not usually used for multiclass
classification problems because of the reasons we mentioned earlier. Also, there is
not a practical rule of thumb to tell if the multiclass hinge loss better or worse than
the softmax loss function.

2.4 Feature Extraction
In practice, it is very likely that samples in the training set X = {(x1 , y1 ), . . . ,
(xn , yn )} are not linearly separable. The multiclass dataset in the previous section is

52

2 Pattern Classification

Fig. 2.25 The two-dimensional space divided into four regions using four linear models fitted using
the softmax loss function. The plot on the right shows the linear models (lines in two-dimensional
case) in the space

Fig. 2.26 A linear classifier is not able to accurately discriminate the samples in a nonlinear dataset

an example of such a dataset. Figure 2.26 shows a nonlinear dataset and the linear
classifier fitted using logistic regression. Samples of each class are illustrated using
a different marker and different color.
Clearly, it is impossible to perfectly discriminate these two classes using a line.
There are mainly two solutions for solving this problem. The first solution is to train
a nonlinear classifier such as random forest on the training dataset. This method is
not within the scope of this book. The second method is to project the original data
into another space using the transformation function Φ : Rd → Rd̂ where classes
are linearly separable in the transformed space. Here, d̂ can be any arbitrary integer
number. Formally, given the sample x ∈ Rd , it is transformed into a d̂-dimensional

2.4 Feature Extraction

space using:

53

⎡

⎤
φ1 (x)
⎢φ2 (x)⎥
⎢
⎥
Φ(x) = x̂ = ⎢ . ⎥
⎣ .. ⎦

(2.72)

φd̂ (x)

where φi : Rd → 1 is a scaler function which accepts a d-dimensional input and
return a scaler. Also, φi can be any function. Sometimes, an expert can design these
functions based on the requirements of the problem. To transform the above nonlinear
dataset, we define Φ(x) as follows:


2
φ1 (x) = e−10x−c1 
Φ(x) = x̂ =
(2.73)
2
φ2 (x) = e−20x−c2 
where c1 = (0.56, 0.67) and c2 = (0.19, 0.11). By applying this function on each
sample, we will obtain a new two-dimensional space where the samples are nonlinearly transformed. Figure 2.27 shows how samples are projected into the new
two-dimensional space. It is clear that the samples in the new space become linearly separable. In other words, the dataset Xˆ = {(Φ(x1 ), y1 ), . . . , (Φ(xn ), yn )} is
linearly separable. Consequently, the samples in Xˆ can be classified using a linear
classifier in the previous section. Figure 2.28 shows a linear classifier fitted on the
data in the new space.
The decision boundary of a linear classifier is a hyperplane (a line in this example). However, because Φ(x) is a nonlinear transformation, if we apply the inverse
transform from the new space to the original space, the decision boundary will not
be a hyperplane anymore. Instead, it will be a nonlinear decision boundary. This is
illustrated in the right plot of Fig. 2.28.
Choice of Φ(x) is the most important step in transforming samples into a new
space where they are linearly separable. In the case of high-dimensional vectors such
as images, finding an appropriate Φ(x) becomes even harder. In some case, Φ(x)
might be composition of multiple functions. For example, one can define Φ(x) =
Ψ (Ω(Γ (x))) where Φ : Rd → Rd̂ , Ψ : Rd2 → Rd̂ , Ω : Rd1 → Rd2 and, Γ : Rd →
Rd1 . In practice, there might be infinite number of functions to make samples linearly
separable.
Let us apply our discussions so far on a real world problem. Suppose the 43
classes of traffic signs shown in Fig. 2.29 that are obtained from the German traffic
sign recognition benchmark (GTSRB) dataset. For the purpose of this example,
we randomly picked 1500 images for each class. Assume a 50 × 50 RGB image.
Taking into account the fact that each pixel in this image is represented by a threedimensional vector, the flattened image will be a 50 × 50 × 3 = 7500 dimensional
vector. Therefore, the training dataset X is composed of 1500 training sample pair
(xi , yi ) where x ∈ R7500 and yi ∈ {0, . . . 42}.
Beside the training dataset, we also randomly pick 6400 test samples (ẋ, ẏi ) from
the dataset that are not included in X . Formally, we have another dataset X˙ of

54

2 Pattern Classification

Fig.2.27 Transforming samples from the original space (left) into another space (right) by applying
Φ(x) on each sample. The bottom colormaps show how the original space is transformed using this
function

traffic signs where ẋ ∈ R7500 and ẋ ∈
/ X and ẏi ∈ {0, . . . 42}. It is very important
in testing a model to use unseen samples. We will explain this topic throughly in the
next chapters. Finally, we can train a linear classifier F(x) using X to discriminate
the 43 classes of traffic signs. Then, F(x) can be tested using X˙ and computing
classification accuracy.
To be more specific, we pick every sample ẋi and predict its class label using F(ẋi ).
Recall from previous sections that for a softmax model with 43 liner models, the class
of sample ẋi is computed using F(ẋi ) = arg maxi=1...43 f i (ẋi ) where f i (ẋi ) = wẋi
is the score computed by the i th model. With this formulation, the classification
accuracy of the test samples is obtained by computing:
6400
1 
1[F(ẋi ) == ẏi ]
acc =
6400
i=1

(2.74)

2.4 Feature Extraction

55

Fig. 2.28 Samples become linearly separable in the new space. As the result, a linear classifier is
able to accurately discriminate these samples. If we transform the linear model from the new space
into the original space, the linear decision boundary become a nonlinear boundary

Fig. 2.29 43 classes of traffic in obtained from the GTSRB dataset (Stallkamp et al. 2012)

where 1[.] is the indicator function and it returns 1 when the input is true. The quantity
acc is equal to 1 when all the samples are classified correctly and it is equal to 0
when all of them are misclassified. We trained a linear model on this dataset using
the raw pixel values. The accuracy on the test set is equal to 73.17%. If we ignore the
intercept, the parameters vector w ∈ R7500 of the linear model f (x) = wxT has the
same dimension as the input image. One way to visualize and study the parameter
vector is to reshape w into a 50 × 50 × 3 image. Then, we can plot each channel in
this three-dimensional array using a colormap plot. Figure 2.30 shows weights of the
model related to Class 1 after reshaping.
We can analyze this figure to see what a linear model trained on raw pixel intensities exactly learns. Consider the linear model f (x) = w1 x1 + · · · + wn xn without
the intercept term. Taking into account the fact that pixel intensities in a regular RGB
image are positive values, xi in this equation is always a positive value. Therefore,
f (x) will return a higher value if wi is a high positive number. In contrary, f (x) will
return a smaller value if wi is a very small negative number. From another perspective, we can interpret positive weights as “likes” and negative weights as “dislikes”
of the linear model.
That being said if wi is negative, the model does not like high values of xi . Hence,
if the intensity of pixel at xi is higher than zero it will reduce the classification score.

56

2 Pattern Classification

Fig. 2.30 Weights of a linear model trained directly on raw pixel intensities can be visualized by
reshaping the vectors so they have the same shape as the input image. Then, each channel of the
reshaped matrix can be shown using a colormap

In contrast, if wi is positive, the model likes high values of xi . In other words, as the
intensity of xi increases, the model becomes more confident about the classification
since it increases the classification score.
Looking at this figure, we see a red region in the middle of red, green and blue
channels. According to the color map next to each plot, red regions correspond to
weights with high positive values. Since, the same region is red in all three channels,
we can imply that the model likes to see the white color in that specific region. Then,
we observe that the region analogous to the rim of the sign has high positive weight
in the red channel and small negative weights in the blue channel. Also, the weights
of the green channel for that region is close to zero. This means that the model likes
to see high red values in that region and it dislikes blue values in that region. This
choice made by the model also seems rational for a human expert. This argument
can be applied on the other classes of traffic signs, as well.
Remember that the accuracy of the model trained on raw pixel intensities was equal
to 73.17%. Now, the question is why the accuracy of the model is very low? To answer
this question, we start with a basic concept. A two-dimensional vector (x1 , x2 ) can be
illustrated using a point in a two-dimensional space. Moreover, a three-dimensional
vector (x1 , x2 , x3 ) can be shown using a point in a three-dimensional space. Similarly,
a d-dimensional vector (x1 , . . . , xd ) is a point in a d-dimensional space. It is trivial
for a human to imagine the points in two-dimensional and three-dimensional spaces.

2.4 Feature Extraction

57

But, it might be difficult at first to imagine higher dimensions. For starting, it suffice
to know that a d-dimensional vector is a point in a d-dimensional space.
Each RGB image in the above example will be a point in a 7500-dimensional space.
We can study the above question in this space. There are mainly two possibilities that
reduces the accuracy of a linear model in this space defined by raw images. First, like
the dataset in Fig. 2.26 the classes of traffic signs might be completely disjoint but
they might not be linearly separable. Second, similar to the dataset in Fig. 2.20, the
classes might have overlap with each other. The latter problem is commonly known
as interclass similarity meaning that samples of two or more classes are similar. In
both cases, a linear model is not able to accurately discriminate the classes.
Although there might not be a quick remedy to the second problem, the first
problem might be addressed by transforming the raw vectors into another space
using the feature transformation function Φ(x). Knowing the fact that output of
Φ(x) is a d̂-dimensional vector, the question in designing Φ(x) is what should be the
value of d̂? Even if we found a way to determine the value of d̂, the next question is
what should be the transformation function φi (x), i = 1, . . . , d̂? There are infinite
ways to define this function. For this reason, it is not trivial in practice to define Φ(x)
for an image (it might not be a tedious task for other modalities with low dimensions).
To alleviate this problem, researchers came up with the idea of feature extraction
algorithms. In general, a feature extraction algorithm processes an image and generates a more informative vector which better separates classes. Notwithstanding,
a feature extraction algorithm does not guarantee that the classes will be linearly
separable. Despite this, in most cases, a feature extraction is applied on an image
before feeding it to a classifier. In other words, we do not classify images using raw
pixel values. Instead, we always extract their feature and train a classifier on top of
the feature vectors.
One of the widely used feature extraction algorithms is called histogram of oriented gradients (HOG). It starts by applying the gamma correction transformation
on the image and computing its first derivatives. Then, the image is divided into
small patches called cells. Within each cell, a histogram is computed based on the
orientation of the gradient vector and its magnitude using the pixels inside that cell.
Then, blocks are formed by considering neighbor cells and the histogram of the
cells within that block are concatenated. Finally, the feature vector is obtained by
concatenating the vectors of all blocks. The whole process of this algorithm can be
easily represented in terms of mathematical equations.
Assume that Φhog (x) : Rd → Rdhog denotes the HOG features. We can now
apply Φhog (x) on each sample of the training set X in order to obtain Xˆ =
{(Φhog (x1 ), y1 ), . . . , (Φhog (xn ), yn )}. Then, a linear classifier is trained using Xˆ .
By doing this, the accuracy of the classification increases to 88.90%. Comparing
with the accuracy of the classifier trained on raw pixel intensities (i.e., 73.17%), the
accuracy increases 15.73%.
There might different reasons that the accuracy is not still very high. First, the
feature extraction function Φhog (x) might not be able to perfectly make the classes
linearly separable. This could be due to the fact that there are traffic signs such as

58

2 Pattern Classification

“left bend ahead” and “right bend ahead” with slight differences. The utilized feature
extraction function might not be able to effectively model these differences such that
these classes become linearly separable. Second, the function Φhog (x) may cause
some of the classes to have overlap with other classes. Both or one of these reasons
can be responsible for having a low accuracy.
Like before, it is possible to create another function whose input is Φhog (x) and its
output is a d̂ dimensional vector. For example, we can define the following function:
⎡
⎤ ⎡ −γ Φhog (x)−c1 2 ⎤
e
φ1 (Φhog (x))
2⎥
⎢φ2 (Φhog (x))⎥ ⎢
e−γ Φhog (x)−c2  ⎥
⎢
⎥ ⎢
⎢
⎥
(2.75)
Φ(Φhog (x)) = ⎢
⎥=⎢
..
..
⎥
⎣
⎦ ⎣
.
.
⎦
2
φd̂ (Φhog (x))
e−γ Φhog (x)−cd̂ 
where γ ∈ R is a scaling constant and ci ∈ Rdhog is parameters which can be
defined manually or automatically. Doing so, we can generate a new dataset
Xˆ = {Φ((Φhog (x1 )), y1 ), . . . , (Φ(Φhog (xn )), yn )} and train a linear classifier on
top of this dataset. This increases the accuracy from 88.90 to 92.34%. Although the
accuracy is higher it is not still high enough to be used in practical applications.
One may add another feature transformation whose input is Φ(Φhog (x)). In fact,
compositing the transformation function can be done several times. But, this does
not guarantee that the classes are going to be linearly separable. Some of the transformation function may increase the interclass overlap causing a drop in accuracy.
As it turns out, the key to accurate classification is to have a feature transformation
function Φ(x) which is able to make the classes linearly separable without causing
interclass overlap. But, how can we find Φ(x) which satisfies both these conditions?
We saw in this chapter that a classifier can be directly trained on the training dataset.
It might be also possible to learn Φ(x) using the same training dataset. If Φ(x) is
designed by a human expert (such as the HOG features), it is called a hand-crafted
or hand-engineered feature function.

2.5 Learning Φ(x)
Despite the fairly accurate results obtained by hand-crafted features on some datasets,
as we will show in the next chapters, the best results have been achieved by learning
Φ(x) from a training set. In the previous section, we designed a feature function
to make the classes in Fig. 2.26 linearly separable. However, designing that feature
function by hand was a tedious task and needed many trials. Note that, the dataset
shown in that figure was composed of two-dimensional vectors. Considering the
fact that a dataset may contain high-dimensional vectors in real-world applications,
designing an accurate feature transformation function Φ(x) becomes even harder.
For this reason, in many cases the better approach is to learn Φ(x) from data.
More specifically, Φ(x; wφ ) is formulated using the parameter vector wφ . Then, the

2.5 Learning Φ(x)

59

linear classifier for i th class is defined as:
f i (x) = wΦ(x; wφ )T

(2.76)

where w ∈ Rd̂ and wφ are parameter vectors that are found using training data.
Depending on the formulation of Φ(x), wφ can be any vector with arbitrary size.
The parameter vector w and wφ determine the weights for the linear classifier and the
transformation function, respectively. The ultimate goal in a classification problem
is to jointly learn this parameter vectors such that the classification accuracy is high.
This goal is exactly the same as learning w such that wxT accurately classifies
the samples. Therefore, we can use the same loss functions in order to train both
parameter vectors in (2.76). Assume that Φ(x; wφ ) is defined as follows:
Φ(x; wφ ) =



ln(1 + e(w11 x1 +w21 x2 +w01 ) )
ln(1 + e(w12 x1 +w22 x2 +w02 ) )

(2.77)

In the above equation wφ = {w11 , w21 , w01 , w12 , w22 , w02 } is the parameter vector
for the feature transformation function. Knowing the fact that the dataset in Fig. 2.26
is composed of two classes, we can minimize the binary logistic loss function for
jointly finding w and wφ . Formally, the loss function is defined as follows:
L (w, wφ ) = −

n


yi log(σ (wΦ(x)T )) + (1 − yi )(1 − log(σ (wΦ(x)T ))) (2.78)

i=1

The intuitive way to understanding the above loss function and computing its gradient
is to build its computational graph. This is illustrated in Fig. 2.31. In the graph, g(z) =
ln(1 + e z ) is a nonlinear function which is responsible for nonlinearly transforming
the space. First, the dot product of the input vector x is computed with two weigh
vectors w1L 0 and w2L 0 in order to obtain z 1L 0 and z 2L 0 , respectively. Then, each of
these values is passed through a nonlinear function and their dot product with w L 2
is calculated. Finally, this score is passed through a sigmoid function and the loss
is computed in the final node. In order to minimize the loss function (i.e., the top
node in the graph), the gradient of the loss function has to be computed with respect
to the nodes indicated by w in the figure. This can be done using the chain rule or
derivatives. To this end, gradient of each node with respect to its parent must be
computed. Then, for example, to compute δ L /δw1L 0 , we have to sum all the paths
from w1L 0 to L and multiply the term along each path. Since there is only from one
path from w1L 0 in this graph, the gradient will be equal to:
δL
δw1L 0

=

δz1L 0 δz1L 1 δz L 2 δp δ L
δw L 0 δz L 0 δz L 1 δz L 2 δp
1

1

(2.79)

1

The gradient of the loss with respect to the other parameters can be obtained in a similar way. After that, we should only plug the gradient vector in the gradient descend

60

2 Pattern Classification

Fig. 2.31 Computational
graph for (2.78). Gradient of
each node with respect to its
parent is shown on the edges

method and minimize the loss function. Figure 2.32 illustrates how the system eventually learns to transform and classify the samples. According to the plots in the
second and third rows, the model is able to find a transformation where the classes
become linearly separable. Then, classification of the samples is done in this space.
This means that the decision boundary in the transformed space is a hyperplane.
If we apply the inverse transform from the feature space to the original space, the
hyperplane is not longer a line. Instead, it is a nonlinear boundary which accurately
discriminates the classes.
In this example, the nonlinear transformation function that we used in (2.77) is
called the softplut function and it is defined as g(x) = ln(1 + e x ). The derivative of
this function is also equal to g  (x) = 1+e1 −x . The softplut function can be replaced
with another function whose input is a scaler and its output is a real number. Also, we
there are many other ways to define a transformation function and find its parameters
by minimizing the loss function.

2.6 Artificial Neural Networks

61

Fig. 2.32 By minimizing (2.78) the model learns to jointly transform and classify the vectors. The
first row shows the distribution of the training samples in the two-dimensional space. The second
and third rows show the status of the model in three different iterations starting from the left plots

2.6 Artificial Neural Networks
The idea of learning a feature transformation function instead of designing it by
hand is very useful and it produces very accurate results in practice. However, as we
pointed out above, there are infinite ways to design a trainable feature transformation
function. But, not all of them might be able to make the classes linearly separable
in the feature space. As the result, there might be a more general way to design a
trainable feature transformation functions.
An artificial neural network (ANN) is an interconnected group of smaller computational units called neurons and it tries to mimic biological neural networks. Detailed
discussion about biological neurons is not within the scope of this book. But, in order

62

2 Pattern Classification

Fig. 2.33 Simplified diagram of a biological neuron

to better understand an artificial neuron we explain how a biological neuron works
in general. Figure 2.33 illustrates a simplified diagram of a biological neuron.
A neuron is mainly composed of four parts including dendrites, soma, axon,
nucleus and boutons. Boutons is also called axon terminals. Dendrites act as the
input of the neuron. They are connected either to a sensory input (such as eye) or
other neurons through synapses. Soma collects the inputs from dendrites. When the
inputs passes a certain threshold it fires series of spikes across the axon. As the
signal is fired, the nucleus returns to its stationary state. When it reaches to this state,
the firing stops. The fired signals are transmitted to other neuron through boutons.
Finally, synaptic connections transmits the signals from one neuron to another.
Depending on the synaptic strengths and the signal at one axon terminal, each
dendron (i.e., one branch of dendrites) increases or decreases the potential of nucleus.
Also, the direction of the signal is always from axon terminals to dendrites. That
means, it is impossible to pass a signal from dendrites to axon terminals. In other
words, the path from one neuron to another is always a one-way path. It is worth
mentioning that each neuron might be connected to thousands of other neurons.
Mathematically, a biological neuron can be formulated as follows:
f (x) = G (wxT + b).

(2.80)

In this equation, w ∈ Rd is the weight vector, x ∈ Rd is the input and b ∈ R is the
intercept term which is also called bias. Basically, an artificial neuron computes the
weighted sum of inputs. This mimics the soma in biological neuron. The synaptic
strength is modeled using w and inputs from other neurons or sensors are modeled
using x. In addition G (x) : R → R is a nonlinear function which is called activation
function. It accepts a real number and returns another real number after applying a
nonlinear transformation on it. The activation function act as the threshold function
in biological neuron. Depending on the potential of nucleus (i.e., wxT + b), the
activation function returns a real number. From computational graph perspective, a
neuron is a node in the graph with the diagram illustrated in Fig. 2.34.
An artificial neural network is created by connecting one or more neurons to the
input. Each pair of neurons may or may not have a connection between them. With

2.6 Artificial Neural Networks

63

Fig. 2.34 Diagram of an artificial neuron

Fig. 2.35 A feedforward neural network can be seen as a directed acyclic graph where the inputs
are passed through different layer until it reaches to the end

this formulation, the logistic regression model can be formulated using only one
neuron where G (x) is the sigmoid function in (2.33). Depending on how neurons
are connected, a network act differently. Among various kinds of artificial neural
networks feedforward neural network (FNN) and recurrent neural network (RNN)
are commonly used in computer vision community.
The main difference between these two kinds of neural networks lies in the connection between their neurons. More specifically, in a feedforward neural network
the connections between neurons do not form a cycle. In contrast, in recurrent neural
networks connection between neurons form a directed cycle. Convolutional neural
networks are a specific type of feedforward networks. For this reason, in the remaining of this section we will only focus on feedforward networks. Figure 2.35 shows
general architecture of feedforward neural networks.
A feedforward neural network includes one or more layers in which each layer
contains one or more neurons. Also, number of neurons in one layer can be different
from another layer. The network in the figure has one input layer and three layers
with computational neurons. Any layer between the input layer and the last layer is
called a hidden layer. The last layer is also called the output layer. In this chapter,

64

2 Pattern Classification

the input layer is denoted by I and hidden layers are denoted by Hi where i starts
from 1. Moreover, the output layer is denoted by Z . In this figure, the first hidden
layer has d1 neurons and the second hidden layer has d2 neurons. Also, the output
layer has dz neurons.
It should be noted that every neuron in a hidden layer or the output layer is connected to all the neurons in the previous layer. That said, there is d1 × d2 connections
between H1 and H2 in this figure. The connection from the i th input in the input
layer to the j th neuron in H1 is denoted by wi1j . Likewise, the connection from the
j th neuron in H1 to the k th neuron in H2 is denoted by w2jk . With this formulation,
the weights connecting the input layer to H1 can be represented using W1 ∈ Rd×d1
where W (i, j) shows the connection from the i th input to the j th neuron.
Finally, the activation function G of each neuron can be different from all other
neurons. However, all the neuron in the same layer usually have the same activation
function. Note that we have removed the bias connection in this figure to cut the
clutter. However, each neuron in all the layers is also have a bias term beside its
weights. The bias term in H1 is represented by b1 ∈ Rd1 . Similarly, the bias of h th
layer is represented by bh . Using this notations, the network illustrated in this figure
can be formulated as:
 


f (x) = G G G (xW1 + b1 )W2 + b2 W3 + b3 .
(2.81)
In terms of feature transformation, the hidden layers act as a feature transformation
function which is a composite function. Then, the output layer act as the linear
classifier. In other words, the input vector x is transformed into a d1 -dimensional
space using the first hidden layer. Then, the transformed vectors are transformed
into a d2 -dimensional space using the second hidden layer. Finally, the output layer
classifies the transformed d2 -dimensional vectors.
What makes a feedforward neural network very special is the fact the a feedforward
network with one layer and finite number of neurons is a universal approximator.
In other words, a feedforward network with one hidden layer can approximate any
continuous function. This is an important property in classification problems.
Assume a multiclass classification problem where the classes are not linearly
separable. Hence, we must find a transformation function which makes the classes
linearly separable in the feature space. Suppose that Φideal (x) is a transformation
function which is able to perfectly do this job. From function perspective, Φideal (x) is
a vector-valued continues function. Since a feedforward neural network is a universal
approximator, it is possible to design a feedforward neural network which is able to
accurately approximate Φideal (x). However, the beauty of feedforward networks is
that we do not need to design a function. We only need to determine the number of
hidden layers, number of neurons in each layer, and the type of activation functions.
These are called hyperparameters. Among them, the first two hyperparameters is
much more important than the third hyperparameter.
This implies that we do not need to design the equation of the feature transformation function by hand. Instead, we can just train a multilayer feedforward network
to do both feature transformation and classification. Nonetheless, as we will see

2.6 Artificial Neural Networks

65

shortly, computing the gradient of loss function on a feedforward neural network
using multivariate chain rule is not tractable. Fortunately, gradient of loss function
can be computed using a method called backpropagation.

2.6.1 Backpropagation
Assume a feedforward network with a two-dimensional input layer and two hidden layers. The first hidden layer consists of four neurons and the second hidden
layer consists of three neurons. Also, the output layer has three neurons. According
to number of neurons in the output layer, the network is a 3-class classifier. Like
multiclass logistic regression, the loss of the network is computed using a softmax
function.
Also, the activation functions of the hidden layers could be any nonlinear function.
But, the activation function of the output layer is the identity function Gi3 (x) = x. The
reason is that the output layer calculates the classification scores which is obtained
by only computing wG 2 . The classification score must be passed to the softmax
function without any modifications in order to compute the multiclass logistic loss.
For this reason, in practice, the activation function of the output layer is the identity
function. This means that, we can ignore the activation function in the output layer.
Similar to any compositional computation, a feedforward network can be illustrated
using a computational graph. The computational graph analogous to this network is
illustrated in Fig. 2.36.

Fig. 2.36 Computational graph corresponding to a feedforward network for classification of three
classes. The network accepts two-dimensional inputs and it has two hidden layers. The hidden layers
consist of four and three neurons, respectively. Each neuron has two inputs including the weights
and inputs from previous layer. The derivative of each node with respect to each input is shown on
thee edges

66

2 Pattern Classification

Each computational node related to function of soma (the computation before
applying the activation function) accepts two inputs including weights and output of
the previous layer. Gradient of each node with respect to its inputs is indicated on the
edges. Also note that wab is a vector whose length is equal to the number of outputs
δL
from layer a − 1. Computing δw
3 is straightforward and it is explained on Fig. 2.24.
i

Assume, we want to compute

δL
.
δw10

According to the multivariate chain rule, this is equal to adding all paths starting
from w10 and ending at L in which the gradients along each path is multiplied. Based
δL
on this definition, δw
1 will be equal to:
0

δH1 G 1 δH2 G 2 δZ0 G 3 L
δL
= 10 01 10 02 2 03 3 +
1
δw0 δw0 δH0 G0 δH0 G0 δH0 G0
δH10 G01 δH20 G02 δZ1 G13 L
+
δw10 δH10 G01 δH20 G02 δH31 G13
δH10 G01 δH20 G02 δZ2 G23 L
+
δw10 δH10 G01 δH20 G02 δH32 G23
δH10 G01 δH21 G12 δZ0 G03 L
+
δw10 δH10 G01 δH21 G12 δH30 G03
δH10 G01 δH21 G12 δZ1 G13 L
+
δw10 δH10 G01 δH21 G12 δH31 G13

(2.82)

δH10 G01 δH21 G12 δZ2 G23 L
+
δw10 δH10 G01 δH21 G12 δH32 G23
δH10 G01 δH22 G22 δZ0 G03 L
+
δw10 δH10 G01 δH22 G22 δH30 G03
δH10 G01 δH22 G22 δZ1 G13 L
+
δw10 δH10 G01 δH22 G22 δH31 G13
δH10 G01 δH22 G22 δZ2 G23 L
δw10 δH10 G01 δH22 G22 δH32 G23
Note that this is only for computing the gradient of the loss function with respect
to the weights of one neuron in the first hidden layer. We need to repeat a similar
procedure for computing the gradient of loss with respect to every node in this graph.
However, although this computation is feasible for small feedforward networks, we
usually need feedforward network with more layers and with thousands of neurons
in each layer to classify objects in images. In this case, the simple multivariate chain
rule will not be feasible to use since a single update of parameters will take a long
time due do excessive number of multiplications.

2.6 Artificial Neural Networks

67

It is possible to make the computation of gradients more efficient. To this end, we
can factorize the above equation as follows:
δH10 G01
=
1
δw0
δw10 δH10
δL



δH20 G02

G01
 2
δH1
G01
 2
δH2
G01

δH20
G12
δH21
G22
δH22

 δZ  G 3 L 
0
0
G02

δH30
 δZ  G 3
0
0
G12 δH30
 δZ  G 3
0
0
G22 δH30

G03

+

 δZ  G 3 L 
1
1
G02

δH31
L   δZ1  G13
+
G03
G12 δH31
L   δZ1  G13
+
G03
G22 δH31

G13
L 
G13
L 
G13

+
+
+

 δZ  G 3 L 
2
2
G02 δH32 G23
 δZ  G 3 L 
2
2
G12 δH32 G23
 δZ  G 3 L 
2
2
G22 δH32 G23

+
+


(2.83)

Compared with (2.82), the above equation requires much less multiplications which
makes it more efficient in practice. The computations starts with the most inner
parenthesizes and moves to the most outer terms. The above factorization has a very
nice property. If we carefully study the above factorization it looks like that the
direction of the edges are hypothetically reversed and instead of moving from w10 to
L the gradient computations moves in the reverse direction. Figure 2.37 shows the
nodes analogous to each inner computation in the above equation.

Fig. 2.37 Forward mode differentiation starts from the end node to the starting node. At each node,
it sums the output edges of the node where the value of each edge is computed by multiplying the
edge with the derivative of the child node. Each rectangle with different color and line style shows
which part of the partial derivative is computed until that point

68

2 Pattern Classification

More precisely, assume the blue rectangles with dashed lines. These rectangles
denote

G03 L

δH30 G03

which corresponds to the node δZ0 on the graph. Furthermore, these

L
Z0 . Likewise, the blue rectangles with dotted lines and
G3
G3
L
L
= δH13 GL3 and Z
= δH23 GL3 respectively.
dashed-dotted lines denote Z
1
2
1 1
  23 2    3

0 G0 L
1 G1 L
+ δZ
+
The rectangles with solid red lines denote δZ
G02 δH30 G03
G02 δH31 G13
  3

δZ2 G2 L 
which is analogous the derivative of the loss function with respect
G02 δH32 G23
2
to δH0 . In other words, before computing this rectangle, we have in fact computed
L
L
L
. Similarly, the dotted and dashed red rectangles illustrate H
2 and H2 respectively.
H02
1
2

rectangles in fact are equal to

The same argument holds true with the green and purple rectangles.
δL
Assume we want to compute δw
1 afterwards. In that case, we do not need to
1

compute none of the terms inside the red and blue rectangles since they have been
δL
computed once for δw
1 . This saves a great amount of computations especially when
0

the network has many layers and neurons.
The backpropagation algorithm has been developed based on this factorization.
It is a method for efficiently computing the gradient of leaf nodes with respect to
each node on the graph using only one backward pass from the leaf nodes to input
nodes. This algorithm can be applied on any computational graph. Formally, let
G =< V, E > denotes a directed acyclic graph where V = {v1 , . . . , v K } is set of
nodes in the computational graph and E = (vi , v j )|vi , v j ∈ V is the set of ordered
pairs (vi , v j ) showing a directed edge from vi to v j . Number of edges going into
a node is called indegree and the number of edges coming out of a node is called
outdegree.
Formally, if in(va ) = {(vi , v j )|(vi , v j ) ∈ E ∧ v j = va } returns set of input edges
to va , indegree of va will be equal to |in(va )| where |.| returns the cardinality of a
set. Likewise, out (va ) = {(vi , v j )|(vi , v j ) ∈ E ∧ vi = va } shows the set of output
edges from va and |out (va )| is equal to the outdegree of va . The computational node
va is called an input if in(va ) = 0 and out (va ) > 0. Also, the computational node
va is called a leaf if out (va ) = 0 and in(va ) > 0. Note that there must be only one
leaf node in a computational graph which is typically the loss. This is due to the fact
the we are always interested in computing the derivative of one node with respect to
all other nodes in the graph. If there are more than one leaf node in the graph, the
gradient of the leaf node of interest with respect to all other leaf nodes will be equal
to zero.
Suppose that the leaf node of the graph is denoted by vlea f . In addition,
let child(va ) = {v j |(vi , v j ) ∈ E ∧ vi = va } and par ent (va ) = {vi |(vi , v j ) ∈ E ∧
v j = va } returns the child nodes and parent nodes of va . Finally, depth of va is
equal to number of edges on the longest path from input nodes to va . We denote
the depth of va by dep(va ). It is noteworthy that for any node vi in the graph that
dep(vi ) ≥ dep(vlea f ) the gradient of vlea f with respect to vi will be equal to zero.
Based on the above discussion, the backpropagation algorithm is defined as follows:

2.6 Artificial Neural Networks

69

Algorithm 1 The backpropagation algorithm
G :< V, E > is a directed graph.
V is set of vertices
E is set of edges
vlea f is the leaf node in V
dlea f ← dep(vlea f )
vlea f .d = 1
for d = dlea f − 1 to 0 do
for va ∈ {vi |vi ∈ V ∧ dep(vi ) == d} do
va .d ← 0
for vc ∈ child(va ) do
δvc
va .d ← va .d + δv
× vc .d
a

The above algorithm can be applied on any computational graph. Generally, the
it computes gradient of a loss function (leaf node) with respect to all other nodes in
the graph using only one backward pass from the loss node to the input node. In the
above algorithm, each node is a data structure which stores information related to
the computational unit including their derivative. Specifically, the derivative of va is
stored in va .d. We execute the above algorithm on the computational graph shown
in Fig. 2.38.
Based on the above discussion, loss is the leaf node. Also, the longest path from
input nodes to the leaf node is equal to dlea f = dep(loss) = 4. According to the
algorithm, vlea f .d must be set to 1 before executing the loop. In the figure, vlea f .d
is illustrated using d8 . Then, the loop start with d = dlea f − 1 = 3. The first inner
loop, iterates over all nodes in which their depth is equal to 3. This is equivalent to Z0
and Z1 on this graph. Therefore, va is set to Z0 in the first iteration. The most inner
loop also iterates over children of va . This is analogous to child(Z0 ) = {loss} which
only has one child. Then, the derivative of va (Z0 ) is set to d6 = va .d = 0 + r × 1.

Fig. 2.38 A sample computational graph with a loss function. To cut the clutter, activations functions have been fused with the soma function of the neuron. Also, the derivatives on edges are
illustrated using small letters. For example, g denotes

δH20
δH11

70

2 Pattern Classification

Table 2.2 Trace of the backpropagation algorithm applied on Fig. 2.38
Depth

Node

Derivative

3

Z0

d6 = r × 1

3

Z1

d7 = s × 1

2

H20

d4 = l × d6 + o × d7

2

H21

d5 = n × d6 + q × d7

1

H10

d1 = e × d4 + i × d5

1

d2 = g × d4 + h × d6

0

H11
H11
w3:0
w3:1
w2:0
w2:1
w1:0
w1:1
w1:2

0

x0

d16 = t × d1 + w × d2 + x × d3

0

x1

d17 = y × d1 + z × d2 + zz × d3

1
0
0
0
0
0
0

d3 = k × d7
d14 = m × d6
d15 = p × d7
d12 = f × d4
d13 = j × d5
d9 = a × d1
d10 = b × d2
d11 = c × d3

After that, the inner loop goes to Z1 and the most inner loop sets derivative of Z1 to
d7 = va .d = 0 + s × 1.
At this point the inner loop finishes and the next iteration of the main loop start
by setting d to 2. Then, the inner loop iterates over H20 and H21 . In the first iteration
of the inner loop, H20 is selected and its derivative d4 is set to 0. Next, the most
inner loop iterates over children of H20 which are Z0 and Z1 . In the first iteration
of the most inner loop d4 is set to d4 = 0 + l × d6 and in the second iteration it
is set to d4 = l × d6 + o × d7 . At this point, the most inner loop is terminated and
the algorithm proceeds with H21 . After finishing the most inner loop, the d5 will
be equal to d5 = n × d6 + q × d7 . Likewise, derivative of other nodes are updated.
Table 2.2 shows how derivative of nodes in different depths are calculated using the
backpropagation algorithm.
We encourage the reader to carefully study the backpropagation algorithm since
it is a very efficient way for computing gradients in complex computational graphs.
Since we are able to compute the gradient of loss function with respect to every
parameter in a feedforward neural network, we can train a feedforward network
using the gradient descend method (Appendix A).
Given an input x, the data is forwarded throughout the network until it reaches to
the leaf node. Then, the backpropagation algorithm is executed and the gradient of
loss with respect to every node given the input x is computed. Using this gradient,
the parameters vectors are updated.

2.6 Artificial Neural Networks

71

2.6.2 Activation Functions
There are different kinds of activation functions that can be used in neural networks.
However, we are mainly interested in activation functions that are nonlinear and
continuously differentiable. A nonlinear activation function makes it possible that a
neural network learns any nonlinear functions provided that the network has enough
neurons and layers. In fact, a feedforward network with linear activations in all
neurons is just a linear function. Consequently, it is important that to have at least
one neuron with a nonlinear activation function to make a neural network nonlinear.
Differentiability property is also important since we mainly train a neural network
using gradient descend method. Although non-gradient-based optimization methods
such as genetic algorithms and particle swarm optimization are used for optimizing
simple functions, but gradient-based methods are the most commonly used methods for training neural networks. However, using non-gradient-based methods for
training a neural network is an active research area.
Beside the above factors, it is also desirable that the activation function approximates the identity mapping near origin. To explain this, we should consider the
activation of a neuron. Formally, the activation of a neuron is given by G (wxT + b)
where G is the activation function. Usually, the weight vector w and bias b are
initialized with values close to zero by the gradient descend method. Consequently, wxT + b will be close to zero. If G approximates the identity function
near zero, its gradient will be approximately equal to its input. In other words,
δ G ≈ wxT + b ⇐⇒ wxT + b ≈ 0. In terms of the gradient descend, it is a strong
gradient which helps the training algorithm to converge faster.

2.6.2.1 Sigmoid
The sigmoid activation function and its derivative are given by the following equations. Figure 2.39 shows their plots.
Gsigmoid (x) =

1
1 + e−x

(2.84)

and

Gsigmoid
(x) = G (x)(1 − G (x)).

(2.85)

The sigmoid activation Gsigmoid (x) : R → [0, 1] is smooth and it is differentiable
everywhere. In addition, it is a biologically inspired activation function. In the past,
sigmoid was very popular activation function in feedforward neural networks. However, it has two problems. First, it does not approximate the identity function near

(x)
zero. This is dues to the fact that Gsigmoid (0) is not close to zero and Gsigmoid
is not close to 1. More importantly, sigmoid is a squashing function meaning that it
saturates as |x| increases. In other words, its gradient becomes very small if x is not
close to origin.
This causes a serious problem in backpropagation which is known as vanishing
gradients problem. The backpropagation algorithm multiplies the gradient of the

72

2 Pattern Classification

Fig. 2.39 Sigmoid activation function and its derivative

activation function with its children in order to compute the gradient of the loss
function with respect to the current node. If x is far from origin, Gsigmoid will be
very small. When it is multiplied by its children, the gradient of the loss with respect
to that node will become smaller. If there are many layers with sigmoid activation,
the gradient starts to become approximately zero (i.e., gradient vanishes) in the first
layers. For this reason, the weight changes will be very small or even negligible.
This cause the network to stuck in the current configuration of parameters and do
not learn anymore. For these reasons, sigmoid activation function is not used in deep
architectures since training the network become nearly impossible.

2.6.2.2 Hyperbolic Tangent
The hyperbolic tangent activation function is in fact a rescaled version of the sigmoid
function. Its defined by the following equations. Figure 2.40 illustrates the plot of
the function and its derivative.
Gtanh (x) =

e x + e−x
2
=
−1
x
−x
e +e
1 + e−2x


Gtanh
(x) = 1 − Gtanh (x)2

(2.86)

(2.87)

The hyperbolic tangent function Gtanh (x) : R → [−1, 1] is a smooth function
which is differentiable everywhere. Its range is [−1, 1] as opposed to range of the
sigmoid function which is [0, 1]. More importantly, the hyperbolic tangent function
approximates the identity function close to origin. This is easily observable from

(0) ≈ 1. This is a desirable property which
the plots where Gtanh (0) ≈ 0 and Gtanh
increases the convergence speed of the gradient descend algorithm. However, similar
to the sigmoid activation function, it saturates as |x| increases. Therefore, it may
suffer from vanishing gradient problems in feedforward neural networks with many
layers. Nonetheless, the hyperbolic activation function is preferred over the sigmoid
function because it approximates the identity function near origin.

2.6 Artificial Neural Networks

73

Fig. 2.40 Tangent hyperbolic activation function and its derivative

2.6.2.3 Softsign
The softsign activation function is closely related to the hyperbolic tangent function.
However, it has more desirable properties. Formally, the softsign activation function
and its derivative are defined as follows:
x
1 + |x|

(2.88)

1
(1 + |x|)2

(2.89)

Gso f tsign (x) =


Gso
f tsign (x) =

Similar to the hyperbolic tangent function, the range of the softsign function is
[−1, 1]. Also, the function is equal to zero at origin and its derivative at origin is
equal to 1. Therefore, is approximates the identity function at origin. Comparing the
function and its derivative with hyperbolic tangent, we observe that it also saturates
as |x| increases. However, the saturation ratio of the softsign function is less than the
hyperbolic tangent function which is a desirable property. In addition, gradient of the
softsign function near origin drops with a greater ratio compared with the hyperbolic
tangent. In terms of computational complexity, softsign requires less computation
than the hyperbolic tangent function. The softsign activation function can be used as
an alternative to the hyperbolic tangent activation function (Fig. 2.41).

2.6.2.4 Rectified Linear Unit
Using the sigmoid, hyperbolic tangent and softsign activation functions is mainly
limited to neural networks with a few layers. When a feedforward network has few
hidden layers it is called a shallow neural network. In contrast, a network with
many hidden layers is called a deep neural network. The main reason is that in deep
neural networks, gradient of these three activation functions vanishes during the
backpropagation which causes the network to stop learning in deep networks.

74

2 Pattern Classification

Fig. 2.41 The softsign activation function and its derivative

Fig. 2.42 The rectified linear unit activation function and its derivative

A rectified linear unit (ReLU) is an activation function which is computationally
very efficient and it is defined as follows:
Gr elu (x) = max(0, x)


0 x <0
Grelu (x) =
1 x ≥0

(2.90)

(2.91)

ReLU is a very simple nonlinear activation function which actually works very
well in practice. Its derivative in R+ is always 1 and it does not saturate in R+ . In
other words, the range of this function is [0, ∞). However, this function does not
approximate the identity function near origin. But because it does not saturate in R+
it always produce a strong gradient in this region. Consequently, it does not suffer
from the vanishing gradient problem. For this reason, it is a good choice for deep
neural networks (Fig. 2.42).
One property of the ReLU activation is that it may produce dead neurons during
the training. A dead neuron always return 0 for every sample in the dataset. This may
happen because the weight of a dead neuron have been adjusted such that wx for the

2.6 Artificial Neural Networks

75

neuron is always negative. As the result, when it is passed to the ReLU activation
function, it always return zero. The advantage of this property is that, the output of a
layer may have entries which are always zero. This outputs can be removed from the
network to make it computationally more efficient. The negative side of this property
is that dead neuron may affect the overall accuracy of the network. So, it is always
a good practice to check the network during training for dead neurons.

2.6.2.5 Leaky Rectified Linear Unit
The basic idea behind Leaky ReLU (Maas et al. 2013) is to solve the problem of dead
neuron which is inherent in ReLU function. The leaky ReLU is defined as follows:

αx x < 0
(2.92)
Grr elu (x) =
x
x ≥0

Grr elu (x)

=

α x <0
1 x ≥0

(2.93)

One interesting property of leaky ReLU is that its gradient does not vanish in negative
region as opposed to ReLU function. Rather, it returns the constant value α. The
hyperparameter α usually takes a value between [0, 1]. Common value is to set α
to 0.01. But, on some datasets it works better with higher values as it is proposed
in Xu et al. (2015). In practice, leaky ReLU and ReLU may produce similar results.
This might be due to the fact that the positive region of these function is identical
(Fig. 2.43).

2.6.2.6 Parameterized Rectified Linear Unit
Parameterized rectified linear unit is in fact (PReLU) the leaky ReLU (He et al.
2015). The difference is that α is treated as a parameter of the neural network so it

Fig. 2.43 The leaky rectified linear unit activation function and its derivative

76

2 Pattern Classification

can be learned from data. The only thing that needs to be done is to compute the
gradient of the leaky ReLU function with respect to α which is given by:

δ G pr elu (x)
x x <0
=
δα
α x ≥0

(2.94)

Then, the gradient of the loss function with respect to α is obtained using the backpropagation algorithm and it is updated similar to other parameters of the neural
network.

2.6.2.7 Randomized Leaky Rectified Linear Unit
The main idea behind randomized rectified linear unit (RReLU) is to add randomness
to the activations during training of a neural network. To achieve this goal, the
RReLU activation draws the value of α from the uniform distribution U (a, b) where
a, b ∈ [0, 1) during training of the network. Drawing the value of α can be done
once for all the network or it can be done for each layer separately. To increase the
randomness, one may draw different α from the uniform distribution for each neuron
in the network. Figure 2.44 illustrates how the function and its derivative vary using
this method.
In the test time, the parameter α is set to the constant value ᾱ. This value is
obtained by computing the mean value of α for each neuron that is assigned during
training. Since the value of alpha is drawn from U (a, b), then value of ᾱ can be easily
obtained by computing the expected value of U (a, b) which is equal to ᾱ = a+b
2 .
2.6.2.8 Exponential Linear Unit
Exponential linear units (ELU) (Clevert et al. 2015) can be seen as a smoothed version
of the shifted ReLU activation function. By shifted ReLU we mean to change the
original ReLU from max(0, x) to max(−1, x). Using this shift, the activation passes
a negative number near origin. The exponential linear unit approximates the shifted

Fig. 2.44 The softplus activation function and its derivative

2.6 Artificial Neural Networks

77

Fig. 2.45 The exponential linear unit activation function and its derivative

ReLU using a smooth function which is given by:

α(e x − 1) x < 0
Gelu (x) =
x
x ≥0


Gelu
(x)


G (x) + α x < 0
=
1
x ≥0

(2.95)

(2.96)

The ELU activation usually speeds up the learning. Also, as it is illustrated in
the plot, its derivative does not drop immediately in the negative region. Instead, the
gradient of the negative region saturates nonlinearly (Fig. 2.45).

2.6.2.9 Softplus
The last activation function that we explain in this book is called Softplus. Broadly
speaking, we can think of the softplus activation function as a smooth version of
the ReLU function. In contrast to the ReLU which is not differentiable at origin,
the softplus function is differentiable everywhere. In addition, similar to the ReLU
activation, its range is [0, ∞). The function and its derivative are defined as follows:
Gso f t plus = ln(1 + e x )

(2.97)

1
(2.98)
1 + e−x
The derivative of the softplus function is the sigmoid function which means the
range of derivative is [0, 1]. The difference with ReLU is the fact that the derivative
of softplus is also a smooth function which saturates as |x| increases (Fig. 2.46).

Gso
f t plus =

78

2 Pattern Classification

Fig. 2.46 The softplus activation function and its derivative

Fig. 2.47 The weights affect the magnitude of the function for a fixed value of bias and x (left).
The bias term shifts the function to left or right for a fixed value of w and x (right)

2.6.3 Role of Bias
Basically, the input to an activation function is wxT + b. The first term in this equation, computes the dot product between w and x. Assume that x is a one-dimensional
vector (scaler). To see the effect of w, we can set b = 0 and keep the value of x
fixed. Then, the effect of w can be illustrated by plotting the activation function for
different values of w. This is shown in left plot in Fig. 2.47.
We observe that changing the weights affects the magnitude of activation function. For example, assume a neural network without a hidden layer where the output layer has only one neuron with sigmoid activation function. The output of the
neural network for inputs x1 = 6 and x2 = −6 are equal to σ (6w + b) = 0.997 and
σ (−6w + b) = 0.002 when w = 1 and b = 0. Suppose we want to find w and keep
b = 0 such that σ (6w + b) = 0.999 and σ (−6w + b) = 0.269. There is no w which
perfectly satisfies these two conditions. But, it is possible to find w that approximates
the above values as accurate as possible. To this end, we only need to minimize the

2.6 Artificial Neural Networks

79

squared error loss of the neuron. If we do this, the approximation error will high
indicating that it is not possible to approximate these values accurately.
However, it is possible to find b where σ (6w + b) = 0.999 and σ (−6w + b) =
0.269 when w = 1. To see the effect of b, we can keep w and x fixed and change the
value of b. The right plot in Fig. 2.47 shows the result. It is clear that the bias term
shifts the activation function to left or right. It gives a neuron more freedom to be
fitted on data.
According to the above discussion, using bias term in a neuron seems necessary. However, bias term might be omitted in very deep neural networks. Assume
the final goal of a neural network is to estimated (x = 6, f (x) = 0.999) and
(x = −6, f (x) = 0.269). If we are forced to only use a single layer neural network with only one neuron in the layer, the estimation error will be high without a
bias term. But, if we are allowed to use more layers and neurons, then it is possible
to design a neural network that accurately approximates these pairs of data.
In deep neural networks, even if the bias term is omitted, the network might be able
to shift the input across different layers if it reduces the loss. Though, it is a common
practice to keep the bias term and train it using data. Omitting the bias term may
only increase the computational efficiency of a neural network. If the computational
resources are not limited, it is not necessary to remove this term from neurons.

2.6.4 Initialization
The gradient descend algorithm starts by setting an initial value for parameters. A
feedforward neural network has mainly two kind of parameters including weights
and biases. All biases are usually initialized to zero. There are different algorithms
for initializing the weights. To common approach is to initialize them using a uniform
or a normal distribution. We will explain initialization methods in the next chapter.
The most important thing to keep in mind is that, weights of the neurons must
be different. If they all have the same value. Neurons in the same layer will have
identical gradients leading to the same update rule. For this reason, weights must be
initialized with different values. Also, they are commonly initialized very close to
zero.

2.6.5 How to Apply on Images
Assume the dataset X = {(x1 , y1 ), . . . , (xn , yn )} where the input vector xi ∈ R1000
is a 1000-dimensional vector and yi = [0, . . . , c] is an integer number indicating the
class of the vector. A rule of thumb in designing a neural network for classification
of these vectors is to have more neurons in the first hidden layer and start to decrease
the number of neurons in the subsequent layers. For instance, we can design a neural
network with three hidden layers where the first hidden layer has 5000 neurons, the
second hidden layer has 2000 neurons and the third hidden layer hast 500 neurons.
Finally, the output layer also will contain c neurons.

80

2 Pattern Classification

One important step in designing a neural network is to count the total number of
parameters in the network. For example, there are 5000 × 1000 = 5,000,000 weights
between the input layer and the first hidden layer. Also, the first hidden layer has 5000
biases. Similarly, the number of parameters between the first hidden layer and second
hidden layer is equal to 5000 × 2000 = 10,000,000 plus 2000 biases. The number of
parameters between the second hidden layer and the third hidden layer is also equal to
2000 × 500 = 1,000,000 plus 500 biases. Finally, the number of weights and biases
between the third hidden layer and the output layer is equal to 500 × c + c. Overall,
this neural network is formulated using 16,007,200 + 500c + c parameters. Even
for this shallow neural network, the number of parameters is very high. Training this
neural network requires a dataset with many training samples. Collecting this dataset
might not be practical.
Now, suppose our aim is to classify traffic signs. The input of the classifier might
be 50 × 50 × 3 images. Our aim is to classify 100 classes of traffic signs. We mentioned before that training a classifier directly on pixel intensities does not produce
accurate results. Better results were obtained by extracting features using the histogram of oriented gradients. We also mentioned that neural networks learn the
feature transformation function automatically from data.
Consequently, we can design a neural network where the input of the network
is raw images and its output is the classification scores of the image per each class
of traffic sign. The neural network learns to extract features from the image so that
they become linearly separable in the last hidden layer. A 50 × 50 × 3 image can be
stored in a three-dimensional matrix. If we flatten this matrix, the results will be a
7500-dimensional vector.
Suppose a neural network containing three hidden layers with 10000-8000-3000
neurons in these layers. This network is parameterized using 179,312,100 parameters. A dramatically smaller neural network with three hidden layers such as 500300-250 will also have 4,001,150 parameters. Although the number of parameters
in the latter neural network is still hight, it may not produce accurate results. In addition, the number of parameters in the former network is very high which makes it
impossible to train this network with the current algorithms, hardware and datasets.
Besides, classification of objects is a complex problem. The reason is that some of
traffic signs differ only slightly. Also, their illumination changes during day. There
are also other factors that we will discuss in the later chapters. For these reasons,
accurately learning a feature transformation function that traffic signs linearly separable in the feature space requires a deeper architecture. As the depth of neural
network increases, the number of parameters may also increase. The reason that a
deeper model is preferable over a shallower model is described on Fig. 2.48.
The wide black line on this figure shows the function that must be approximated
using a neural network. The red line illustrates the output of a neural network including four hidden layers with 10-10-9-6 architecture using the hyperbolic tangent activation functions. In addition, the white line shows the output of a neural network
consisting of five layers with 8-6-4-3-2 architecture using the hyperbolic tangent
activation function. Comparing the number of parameters in these two networks, the
shallower network has 296 parameters and the deeper network has 124 parameters. In

2.6 Artificial Neural Networks

81

Fig. 2.48 A deeper network requires less neurons to approximate a function

general, deeper models require less parameters for modeling a complex function. It
is obvious from figure that the deeper model is approximated the function accurately
despite the fact that it has much less parameters.
Feedforward neural networks that we have explained in this section are called
fully connected feedforward neural networks. The reason is that every neuron in
one layer is connected to all neurons in the previous layer. As we explained above,
modeling complex functions such as extracting features from an image may require
deep neural networks. Training deep fully connected networks on dataset of images
is not tractable due to very high number of parameters. In the next chapter, we will
explain a way to dramatically reduce the number of parameters in a neural network
and train them on images.

2.7 Summary
In this chapter, we first explained what are classification problems and what is a
decision boundary. Then, we showed how to model a decision boundary using linear
models. In order to better understand the intuition behind a linear model, they were
also studied from geometrical perspective. A linear model needs to be trained on
a training dataset. To this end, there must be a way to assess how good is a linear
model in classification of training samples. For this purpose, we thoroughly explained
different loss functions including 0/1 loss, squared loss, hinge loss, and logistic loss,
Then, methods for extending binary models to multiclass models including oneversus-one and one-versus-rest were reviewed. It is possible to generalize a binary
linear model directly into a multiclass model. This requires loss functions that can
be applied on multiclass dataset. We showed how to extend hinge loss and logistic
loss into multiclass datasets.
The big issue with linear models is that they perform poorly on datasets in which
classes are not linearly separable. To overcome this problem, we introduced the

82

2 Pattern Classification

idea of feature transformation function and applied it on a toy example. Designing
a feature transformation function by hand could be a tedious task especially when
they have to be applied on high-dimensional datasets. A better solution is to learn
a feature transformation function directly from training data and training a linear
classifier on top of it.
We developed the idea of feature transformation from simple functions to compositional functions and explained how neural networks can be used for simultaneously
learning a feature transformation function together with a linear classifier. Training a
complex model such as neural network requires computing gradient of loss function
with respect to every parameter in the model. Computing gradients using conventional chain rule might not be tractable. We explained how to factorize a multivariate
chain rule and reduce the number of arithmetic operations. Using this formulation,
we explained the backpropagation algorithm for computing gradients on any computational graph.
Next, we explained different activation functions that can be used in designing
neural networks. We mentioned why ReLU activations are preferable over traditional
activations such as hyperbolic tangent. Role of bias in neural networks is also discussed in detail. Finally, we finished the chapter by mentioning how an image can
be used as the input of a neural network.

2.8 Exercises
2.1 Find an equation to compute the distance of point p from a line.
2.2 Given the convex set X ⊂ Rd , we know that function f (x) : X → R is convex
if:
∀x1 ,x2 ∈X,α∈[0,1] f (αx1 + (1 − α)x2 ) ≤ α f (x1 ) + (1 − α) f (x2 ).

(2.99)

Using the above definition, show why 0/1 loss function is nonconvex?
2.3 Prove that square loss is a convex function.
2.4 Why setting a in the hinge loss to different values does not affect the classification accuracy of the learn model?
2.5 Compute the partial derivative of the squared hinge loss and modified Huber
loss functions.
2.6 Apply log(A × B) = log(A) log(B) on (2.39) to obtain (2.39).

2.8 Exercises

83

2.7 Show that:
δσ (a)
= σ (a)(1 − σ (a)).
a

(2.100)

2.8 Find the partial derivative of (2.41) with respect to wi using the chain rule of
derivatives.
2.9 Show how we obtained (2.46).
2.10 Compute the partial derivatives of (2.46) and use them in the gradient descend
method for minimizing the loss represented by this equation.
2.11 Compute the partial derivatives of (2.56) and obtain (2.57).
2.12 Draw an arbitrary computation graph with three leaf nodes and call them A,
B and C. Show that δC/δ A = 0 and δC/δ B = 0
2.13 Show that a feedforward neural network with linear activation functions in all
layers is in fact just a linear function.
2.14 Show that it is impossible to find a w such that:
1
= 0.999
1 + e−6w
1
σ (−6w) =
= 0.269
1 + e6w
σ (6w) =

(2.101)

References
Clevert DA, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by exponential linear units (ELUs). 1997, pp 1–13. arXiv:1511.07289
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. arXiv:1502.01852
Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic
models. In: ICML workshop on deep learning for audio, speech and language processing, vol 28.
http://www.stanford.edu/~awni/papers/relu_hybrid_icml2013_final.pdf
Stallkamp J, Schlipsing M, Salmen J, Igel C (2012) Man vs. computer: benchmarking machine
learning algorithms for traffic sign recognition. Neural Netw 32:323–332. doi:10.1016/j.neunet.
2012.02.016
Xu B, Wang N, Chen T (2015) Empirical evaluation of rectified activations in convolutional network.
arXiv:1505.00853v2

3

Convolutional Neural Networks

In the previous chapter, we explained how to train a linear classifier using loss
functions. The main problem of linear classifiers is that the classification accuracy
drops if the classes are not separable using a hyperplane. To overcome this problem,
the data can be transformed to a new space where classes in this new space are
linearly separable. Clearly, the transformation must be nonlinear.
There are two common approaches to designing a transformation function. In the
first approach, an expert designs a function manually. This method could be tedious
especially when dimensionality of the input vector is high. Also, it may not produce
accurate results and it may require many trials and errors for creating an accurate
feature transformation function. In the second approach, the feature transformation
function is learned from data.
Fully connected feedforward neural networks are commonly used for simultaneously learning features and classifying data. The main problem with using a fully
connected feedforward neural network on images is that the number of neurons could
be very high even for shallow architectures which makes them impractical for applying on images. The basic idea behind convolutional neural networks (ConvNets) is
to devise a solution for reducing the number of parameters allowing a network to be
deeper with much less parameters.
In this chapter, we will explain principals of ConvNets and we will describe
a few examples where ConvNets with different architectures have been used for
classification of objects.

3.1 Deriving Convolution from a Fully Connected Layer
Recall from Sect. 2.6 that in a fully connected layer, all neurons are connected to
every neuron in the previous layer. In the case of grayscale images, input of first
hidden layer is a W × H matrix which is denoted by x ∈ [0, 1]W ×H . Here, we have
© Springer International Publishing AG 2017
H. Habibi Aghdam and E. Jahani Heravi, Guide to Convolutional
Neural Networks, DOI 10.1007/978-3-319-57550-6_3

85

86

3 Convolutional Neural Networks

Fig. 3.1 Every neuron in a fully connected layers is connected to every pixel in a grayscale image

indicated intensity of pixels by a real number between 0 and 1. But, the following
argument holds true for intensities within any range. Assuming that there are K
neurons in the first hidden layer, each neuron Hi1 , i = 0 . . . K in the hidden layer is
connected to all pixels in the grayscale image leading to W × H connections only
for Hi1 .
Assume the 16 × 16 grayscale image illustrated in Fig. 3.1 which is connected
to a hidden layer consisting of 7200 neurons. As it is indicated in this figure, the
image can be thought as a 16 × 16 = 1024 dimensional vector. The first neuron in
the hidden layer is connected to 1024 elements in the input. Similarly, other neurons
are also connected on every element of the input image. Consequently, this fully
connected layer is formulated using 1024 × 7200 = 7,372,800 distinct parameters.
One way to reduce the number of parameters is to reduce the number of neurons in
the hidden layer. However, this may adversely affect the classification performance.
For this reason, we usually need to keep the number of neurons in the first hidden
layer high. In order to reduce the number of parameters, we first hypothetically
rearrange the 7200 neurons into 50 blocks of 12 × 12 neurons. This is illustrated in
Fig. 3.2. Here, fi , i = 0 . . . 49 shows the number of the block. Each block is formed
using 12 × 12 neurons.
The number of required parameters is still 1024 × 50 × 12 × = 7,372,800. We
can dramatically reduce the number of parameters by considering the geometry of
pixels in an image. Concretely, the pixel (m, n) in an image is highly correlated with
its close neighbors than its far neighbors. Assume that neuron (0, 0) in each block
is intended to extract information from a region around pixel (2, 2) in the image.
Likewise, neuron (11, 11) in all blocks is intended to extract information from pixel
(14, 14) in the image.
Since the correlation between far pixels is very low, neuron (0, 0) needs only
information from pixel (2, 2) and its neighbors in order to extract information from
this region. For example, in Fig. 3.3, we have connected each neuron in each block
to a 5 × 5 region in the image. Neurons in a block cover all the input image and
extract information for each 5 × 5 patch in the input image.

3.1 Deriving Convolution from a Fully Connected Layer

87

Fig. 3.2 We can hypothetically arrange the neurons in blocks. Here, the neurons in the hidden layer
have been arranged into 50 blocks of size 12 × 12

Fig. 3.3 Neurons in each block can be connected locally to the input image. In this figure, each
neuron is connected to a 5 × 5 region in the image

By this way, the number of parameters is reduced to (5 × 5) × 50 × 12× 12
= 180,000 which 97.5% reduction in number of parameters compared with the fully
connected approach. But this number can be reduced further. The weights in Fig. 3.3
have been illustrated using small letters. We observe that each neuron in a block has
a different weight compared with other neurons in the same block. To further reduce
the number of parameters, we can assume that all neurons in one block share the
same weights. This is shown in Fig. 3.4. This means that, each block is formulated
using only 25 weights. Consequently, there are only 5 × 5 × 50 = 1250 weights
between the image and the hidden layer leading to 99.98% reduction in the number
of parameters compared with the fully connected layer.

88

3 Convolutional Neural Networks

Fig. 3.4 Neurons in one block can share the same set of weights leading to reduction in the number
of parameters

This great amount of reduction was achieved using a technique called weight
sharing between neurons. Denoting the neuron (p, q) in block l in the above figure
l , the output of this neuron is given by
by fp,q
4
4 


l
l
fp,q
= (G )
im(p + i, q + j)wi,j

(3.1)

i=0 j=0
l
where wa,b
shows the weight (a, b) in block l and p, q = 0, . . . , 11. In the above
example, a and b varies between 0 and 4 since each neuron is connected to a 5 × 5
region. The region to which a neuron is connected is called receptive field of the
neuron. In this example, the receptive field of the neuron is a 5 × 5 region. The
output of each block will have the same size as its block. Hence, in this example, the
output of each block will be a 12 × 12 matrix. With this formulation and denoting
the output matrix of lth with fl , this matrix can be obtained by computing
4
4 


l
∀p, q ∈ 0, 11.
im(p + i, q + j)wi,j
fl (p, q) = (G )

(3.2)

i=0 j=0

The above equation is exactly analogous to convolving the 5 × 5 filter w with the
input image.1 As the result, output of the lth block is obtained by convolving the filter
w on the input image. The convolution operator is usually denoted by ∗ in literature.
Based on the above discussion, the layer in Fig. 3.4 can be represented using a filter
and convolution operator that is illustrated in Fig. 3.5.
1 Readers

that are not familiar with convolution can refer to textbooks of image processing for
detailed information.

3.1 Deriving Convolution from a Fully Connected Layer

89

Fig. 3.5 The above convolution layer is composed of 49 filters of size 5×. The output of the layer
is obtained by convolving each filter on the image

The output of a convolution layer is obtained by convolving each filter on the
input image. The output of the convolution layer will be series of images where the
number of images is analogous to the number of filters. Then, the activation function
is applied on each image separately in element-wise fashion. In general if the size of
the image is W × H and the convolution layer is composed of L filters of size M × N,
the output of the convolution layer will be L images of size W − M + 1 × H − N + 1
where each image is obtained by convolving the corresponding filter with the input
image.
A deep convolutional neural network normally contains more than one convolution
layer. From image processing perspective, a convolution filter is a two-dimensional
array (i.e. matrix) which is applied on a grayscale image. In the case of multichannel
images such as RGB images, the convolution filter might be still a two-dimensional
array which is separately applied on each channel.
However, the main idea behind convolution filters in ConvNets is that the result of
convolving a filter with a multichannel input is always single channel. In other words,
if convolution filter f is applied on the RGB image X with three channels, X ∗ f must
be a single-channel image. A multichannel image can be seen as a three-dimensional
array where the first two dimensions show the spatial coordinate of pixels and the
third dimension shows the channel. For example, a 800 × 600 RGB image is stored
in a 600 × 800 × 3 array. In the same way, a 640 × 480 multispectral image which
is taken in 7 different spectrum is stored in a 480 × 600 × 7 array.
Assuming that X ∈ RH×W ×C is a multichannel image with C channels, our aim


is to design the filter f such that X ∗ f ∈ RH ×W ×1 where H  and W  depends on the
height and width of the filter respectively. To this end, f must be a three-dimensional
filter where the third dimensional is alway equal to the number of input channels.
Formally, if f ∈∈ Rh×w×C then X ∗f ∈ RH−h+1×W −w+1×1 . Based on this definition,
we can easily design multiple convolution layers. As example is illustrated in Fig. 3.6.
In this example, the input of the network is a single-channel image (i.e., a grayscale
image). The first convolution layers contains L1 filters of size M1 × N1 × 1. The third
dimension of filters is 1 because the input of this layer is a single-channel image.

90

3 Convolutional Neural Networks

Fig. 3.6 Normally, convolution filters in a ConvNet are three-dimensional array where the first
two dimensions are arbitrary numbers and the third dimension is always equal to the number out
channels in the previous layer

Applying these filters on the input will produce L1 images of size H − M1 + 1 ×
W − N1 + 1. From another perspective, the output of the first convolution layer can
be seen as a multichannel image with L1 channels. Then, the activation function G
is applied on every element of this multichannel image, separately.
Based on the above discussion, the filter of the second convolution layer must be
M2 × N2 × L1 so that convolving a filter with the L1 -channel input will have always
single channel. In addition, M2 and N2 could be any arbitrary numbers. Similarly,
output of the second convolution layer will be a L2 channel image. In terms of
ConvNets, output of convolution layers is called feature maps where a feature map
is the result of convolving a filter with the input of layer. In sum, it is important to
keep this in mind that the convolution filter in ConvNets are mainly three-dimensional
filters where the third dimension is always equal to the number of channels in the
input.2

3.1.1 Role of Convolution
Bank of Gabor filters is one of powerful methods for extracting features. The core of
this method is to create a bank of N Gabor filters. Then, each filter is convolved with
the input image. This way, N different images will be produced. Then, pixels of each
image are pooled to extract information from each image. We shall discuss about
pooling later in this chapter. There are mainly two steps in this method including
convolution and pooling.
In a similar approach, ConvNets extract features based on this fundamental idea.
Specifically, ConvNets apply series of convolution and pooling in order to extract
features. it is noteworthy to study the role of convolution in ConvNets. For this
purpose, we generated two consecutive convolution layers where the input of the
first layer is an RGB image. Also, the first layer has six filters of size 7 × 7 × 3 and

2 In

the case of video, convolution filters could be four-dimensional. But, for the scope of this book
we only mention usual filters which are applied on images.

3.1 Deriving Convolution from a Fully Connected Layer

91

Fig. 3.7 From ConvNet point of view, an RGB image is a three-channel input. The image is taken
from www.flickr.com

the second layer has one filter of size 5 × 5 × 6. This simple ConvNet is illustrated
in Fig. 3.7.
The input has three channels. Therefore, the third dimension of the first convolution layers has to be equal to 3. Applying each filter in the first layer on the image
will produce a single-channel image. Hence, output of the first convolution layer will
be a six-channel image. Then, because there is not activation function after the first
convolution layer, it is directly fed to the second convolution layer. Based on our
previous discussion, the third dimension of the filter in the second convolution layer
has to be 6. Since there is only one filter in the second convolution layer, the output
of this layer will be a single-channel image.
For the purpose of this example, we generated random filters for both first and
second convolution layers. Looking at the results of the first convolution layer, we
observe that two filters have acted as low-pass filters (i.e., smoothing filters) and
rest of the filters have acted as high-pass filters (i.e., edge detection filter). Then,
the second convolution layer has generated a single-channel image where the value
at location (m, n) is obtained by linearly combining all six channels of the first
convolution layer in the 5 × 5 neighborhood at location (m, n). Comparing the result
of the second layer with results of the first layer, we see that the second layer has
intensified the strong edges around eyes of cat and her nose. In addition, although
edges generated by fur of cat are stronger in the output of the first layer, they have
diminished by the second layer.
Note that filters in the above example are just random filters. In practice, a ConvNet
learns to adjust the weights of filters such that different classes become linearly
separable in the last layer of the network. This is done during training procedure.

92

3 Convolutional Neural Networks

3.1.2 Backpropagation of Convolution Layers
In Sect. 2.6.1, we explained how to compute the gradient of a leaf node in a computational graph with respect to every node in the graph using a method called
backpropagation. Training a convolution layer also requires the gradient of convolution layer with respect to its parameters and to its inputs. To simplify the problem, we
study backpropagation on a one-dimensional convolutional layer. Figure 3.8 shows
two layers from a ConvNet where the neurons of the layer in right share the same
weight and they are also locally connected. This integrally shows that the output of
the second layer is obtained by convolving the weights W2 with H1 .
In this graph, W = {w0 , w1 , w2 }, wi ∈ R is the weight vector. Moreover, assume
that we already know the gradient of loss function (i.e., the leaf node in computational
graph) with respect to the computational node in H2 . This is illustrated using δi , i =
L
is given by
0, . . . , 3 on figure. According to backpropagation algorithm, δδw
i
δH20 δ L
δH21 δ L
δH22 δ L
δH23 δ L
δL
=
+
+
+
2
2
2
δwi
δwi δH0
δwi δH1
δwi δH2
δwi δH23
δH20
δH21
δH22
δH23
=
δ0 +
δ1 +
δ2 +
δ3
δwi
δwi
δwi
δwi

Fig. 3.8 Two layers from
middle of a neural network
indicating the
one-dimensional
convolution. The weight W2
is shared among the neurons
of H2 . Also, δi shows the
gradient of loss functions
with respect to Hi2

(3.3)

3.1 Deriving Convolution from a Fully Connected Layer

93

By computing the above equation for each wi in the graph we will obtain
δL
= h0 δ0 + h1 δ1 + h2 δ2 + h3 δ3
δw0
δL
= h1 δ0 + h2 δ1 + h3 δ2 + h4 δ3
δw1
δL
= h0 δ0 + h3 δ1 + h4 δ2 + h5 δ3
δw2

(3.4)

Let δ 2 = [δ0 , δ1 , δ2 , δ3 ] denotes the vector of gradients in of H2 and h1 =
[h0 , h1 , h2 , h3 , h4 , h5 ] denotes the output of neurons in H1 . If we carefully study
the above equation, we will realize that computing
h1 ∗ δ 2

(3.5)

L L L
will return L
W = [ w0 , w1 , w2 ]. As before, the operator ∗ denotes the valid convolution
operation. In general, gradient of loss function with respect to the convolution filters
is obtained by convolving δ of current layer with the inputs of the layer.
L
L
, we also need to compute δδh
in order to pass the error to previous
Beside δδw
i
i
layer. According to the backpropagation algorithm, we can compute these gradients
as follows:

δL
δh0
δL
δh1
δL
δh2
δL
δh3
δL
δh4
δL
δh4

=
=
=
=
=
=

H02
δ0
h0
H02
δ0 +
h1
H02
δ0 +
h2
H12
δ1 +
h3
H22
δ2 +
h4
H32
δ3
h5

H12
δ1
h1
H12
δ1 +
h2
H22
δ2 +
h3
H32
δ3
h4

H22
δ2
h2
H32
δ3
h3

(3.6)

94

By computing

3 Convolutional Neural Networks
Hi2
hj

and plugging it in the above equation, we will obtain
δL
δh0
δL
δh1
δL
δh2
δL
δh3
δL
δh4
δL
δh4

= w0 δ0
= w1 δ0 + w0 δ1
= w2 δ0 + w1 δ1 + w0 δ2
(3.7)
= w2 δ1 + w1 δ2 + w0 δ3
= w2 δ2 + w1 δ3
= w2 δ3

If we carefully study the above equation, we will realize that computing
δ 2 ∗ flip(W )

(3.8)

will give us the gradient of loss with respect to every node in the previous layer.
Note that ∗ here refers to the full convolution and flip is a function that reverses the
direction of W . In general, in a convolution layer, the gradient of current layer with
respect to the nodes in previous layer is obtained by convolving δ of current layer
with the reverse of convolution filters.

3.1.3 Stride in Convolution
Given the image X ∈ RW ×H , convolution of kernel f ∈ RP×Q with the image is
given by

(X∗f )(m, n) =

P−1
 Q−1


X(m+i, n+j)f (i, j)

m = 0, . . . , H −1, n = 0, . . . , W −1 (3.9)

i=0 j=0

The output of the above equation is a W − P + 1 × H − Q + 1 image where
value of each element is computed using the above equation. Technically, we say
that the stride of the convolution is equal to one meaning that the above equation is
computed for every m and n in X.
As we will discuss shortly, in some cases, we might be interested in computing the
convolution with a larger stride. For example, we want to compute the convolution

3.1 Deriving Convolution from a Fully Connected Layer

95

of alternate pixels. In this case, we say that the stride of convolution is equal to two,
leading to the equation below
(X ∗ f )(m, n) =

P−1
 Q−1


X(m + i, n + j)f (i, j)

m = 0, 2, 4 . . . , H − 1, n = 0, 2, 4, . . . , W − 1

i=0 j=0

(3.10)
The result of the above convolution will be a W 2−P +1× H−Q
2 +1 image. Common
values for stride are 1 and 2 and you may rarely find a convolution layer with a stride
greater than 3. In general, denoting the stride with s, size of the output matrix will
H−Q
be equal to W −P
s + 1 × s + 1. Note that the value of stride and filter size must
be chosen such that W 2−P + 1 and H−Q
2 + 1 become integer number. Otherwise, X
has to be cropped so they become integer numbers.

3.2 Pooling
In Sect. 3.1.1 we explained that the feature extraction based on bank of Gabor filters
is done in two steps. After convolving input image with many filters in the first step,
the second step in this method locally pools pixels to extract information.
A similar approach is also used in ConvNets. Specifically, assume a 190 × 190
image which is connected to a convolution layer containing 50 filters of size 7 × 7.
Output of the convolution layer will contain 50 feature maps of size 184 × 184 which
collectively represent a 50-channel image. From another perspective, output of this
layer can be seen as a 184 × 184 × 50 = 1,692,800 dimensional vector. Clearly, this
number of dimensions is very high.
The major goal of a pooling layer is to reduce the dimensionality of feature
maps. For this reason, it is also called downsampling. The factor to which the
downsampling will be done is called stride or downsampling factor. We denote
the pooling stride by s. For example, assume the 12-dimensional vector x =
[1, 10, 8, 2, 3, 6, 7, 0, 5, 4, 9, 2]. Downsampling x with stride s = 2 means we have
to pick every alternate pixel starting from the element at index 0 which will generate
the vector [1, 8, 3, 7, 5, 9]. By doing this, dimensionality of x is divided by s = 2
and it becomes a six-dimensional vector.
Suppose that x in the above example shows the response of a model to an input.
When the dimensionality of x is reduced using downsampling, it ignores the effect
of other value between alternate pixels. For example, downsampling vectors x1 =
[1, 10, 8, 2, 3, 6, 7, 0, 5, 4, 9, 2] and x2 = [1, 100, 8, 20, 3, 60, 7, 0, 5, 40, 9, 20]
with stride s = 2 will both produce [1, 8, 3, 7, 5, 9]. However, x1 and x2 may represent two different states of input. As the result, important information might be
discarded using simple downsampling approach.
Pooling generalizes downsampling by considering the elements between alternate
pixels as well. For instance, a max pooling with stride s = 2 and size d = 2 will
downsample x as

96

3 Convolutional Neural Networks

Fig. 3.9 A pooling layer reduces the dimensionality of each feature map separately

xmax−pool = [max(1, 10), max(8, 2), max(3, 6), max(7, 0), max(5, 4), max(9, 2)]
(3.11)
that is equal to xmax−pool = [10, 8, 6, 7, 5, 9]. Likewise, max pooling x1 and x2 will
produce [10, 8, 6, 7, 5, 9] and [100, 20, 60, 7, 50, 20], respectively. Contrary to the
simple downsampling, we observe that max pooling does not ignore any element.
Instead, it intelligently reduces the dimension of the vector taking into account the
values in local neighborhood of the current element, the size of local neighborhood
is determined by d.
Pooling feature maps of a convolution layer is done in a similar way. This is
illustrated in Fig. 3.9 where the 32 × 32 image is max-pooled with stride s = 2 and
d = 2. To be more specific, the image is divided into a d × d region every s pixels
row-wise and column-wise. Each region corresponds to a pixel in the output feature
map. The value at each location in the output feature map is obtained by computing
the maximum value in the corresponding d × d region in the input feature map. In
this figure, it is shown how the value at location (7, 4) has been computed using its
corresponding d × d region.
It is worth mentioning that pooling is applied on each feature map separately. That
means if the output of a convolution layer has 50 feature maps (i.e., the layer has 50
filters), the pooling operation is applied on each of these feature maps separately and
produce another 50-channel feature maps. However, dimensionality of feature maps
are spatially reduced by factor of s. For example, if the output of a convolution layer
is a 184 × 184 × 50 image, it will be a 92 × 92 × 50 image after max-pooling with
stride s = 2.
Regions of pooling may overlap with each other. For example, there is not overlap
in the d × d regions when the stride is set to s = d. However, by setting s = a, a < d
the region will overlap with surrounding regions. Pooling stride is usually set to 2
and the size pooling region is commonly set to 2 or 3.
As we mentioned earlier, the major goal of pooling is to reduce the dimensionality
of feature maps. As we will see shortly, this makes it possible to design a ConvNet

3.2 Pooling

97

where the dimensionality of the feature vector in the last layer is very low. However,
the need for pooling layer has been studied by researchers such as Springenberg
et al. (2015). In this work, the authors show that a pooling layer can be replaced by
a convolution layer with convolution stride s = 2. Some ConvNets such as Aghdam
et al. (2016) and Dong et al. (2014) do not use pooling layers since their aim is to
generate a new image for a given input image.
Average pooling is an alternative max pooling in which instead of computing
the maximum value in a region, the average value of the region is calculated. However, Scherer et al. (2010) showed that max pooling produces superior results than
average-pooling layer. In practice, average pooling is rarely used in middle layers.
Another pooling method is called stochastic pooling (Zeiler and Fergus 2013). In this
approach, a value from the region is randomly picked where elements with higher
values are more likely to be picked by the algorithm.

3.2.1 Backpropagation in Pooling Layer
Pooling layers are also part of the computational graph. However, in contrast to
convolution layers which are formulated using some parameters, the pooling layers
that we mentioned in the previous section do not have trainable parameters. For
this reason, we only need to compute their gradient with respect to the previous
layer. Assume the one-dimensional layer in Fig. 3.10. Each neuron in the right layer
computes the maximum of its inputs.
We need to compute gradient of each neuron with respect to it inputs. This can be
easily computed as follows:

Fig. 3.10 A onedimensional max-pooling
layer where the neurons in
H2 compute the maximum of
their inputs

98

3 Convolutional Neural Networks


δH20
1
=
δh0
0

δH20
1
=
δh1
0

δH20
1
=
δh2
0

δH21
1
=
δh3
0

δH21
1
=
δh4
0

δH21
1
=
δh5
0

max(h0 , h1 , h2 ) == h0
otherwise
max(h0 , h1 , h2 ) == h1
otherwise
max(h0 , h1 , h2 ) == h2
otherwise
max(h3 , h4 , h5 ) == h3
otherwise

(3.12)

max(h3 , h4 , h5 ) == h4
otherwise
max(h3 , h4 , h5 ) == h5
otherwise

According to the above equation, if neuron hi is selected during the max-pooling
operation, the gradient from next layer will be passed to hi . Otherwise, the gradient
will not be passed to hi . In other words, if hi is not selected during max pooling,
L
hi = 0. Gradient of the stochastic pooling is also computed in a similar way.
Concretely, the gradient is passed to the selected neurons and it is blocked to other
neurons. In the case of average pooling, gradient to all input neurons are equal to
1/n where n denotes the number of inputs of the pooling neuron.

3.3 LeNet
The basic concept of ConvNets dates back to 1979 when Kunihiko Fukushima proposed an artificial neural network including simple and complex cells which were
very similar to convolution and pooling layers in modern ConvNets (Schmidhuber 2015). In 1989, LeCun et al. (1998) proposed the weight sharing paradigm and
derived convolution and pooling layers. Then, they designed a ConvNet which is
called LeNet-5. The architecture of this ConvNet is illustrated in Fig. 3.11.

Fig. 3.11 Representing LeNet-5 using a DAG

3.3 LeNet

99

In this DAG, Ca, b shows a convolution layer with a filters of size b × b and the
phrase /a in any nodes shows the stride of that operation. Moreover, P/a, b denotes
a pooling operation with stride a and size b, FCa shows a fully connected layer with
a neurons, Ya shows the output layer with a neurons.
This ConvNet that is originally proposed for recognizing handwritten digits consists of four convolution-pooling layers. The input of the ConvNet is a single-channel
32 × 32 image. Also, the last pooling layer (S4) is connected to the fully connected
layer C5. The convolution layer C1 contains six filters of size 5 × 5. Convolving a
32 × 32 image with these filters produces six feature maps of size 28 × 28 (recall
from previous discussion that 32(width) − 5(filterwidth) + 1 = 28). Since the input
of the network is a single-channel image, the convolution filter in C1 are actually
5 × 5 × 1 filters.
The convolution layer C1 is followed by the pooling layer S2 with stride 2. Thus,
the output of S2 is six feature maps of size 14 × 14 which collectively show a sixchannel input. Then, 16 filters of size 5 × 5 are applied on the six-channel image in
the convolution layer C3. In fact, the size of convolution filters in C3 is 5 × 5 × 6. As
the result, the output of C3 will be 16 images of size 10 × 10 which, together, show
a 16-channel input. Next, the layer S4 applies a pooling operation with stride 2 and
produces 16 images of size 5 × 5. The layer C5 is a fully connected layer in which
every neuron in C5 is connected to all the neuron in S4. In other words, every neuron
in C5 is connected to 16 × 56 × 5 = 400 neurons in S4. From another perspective,
C5 can be seen as a convolution layer with 120 filters of size 5 × 5. Likewise, S6
is also a fully connected layer that is connected to S5. Finally, the classification
layer is a radial basis function layer where the inputs of the radial basis function
are 84 dimensional vectors. However, for the purpose of this book, we consider the
classification layer a fully connected layer composed of 10 neurons (one neuron for
each digit).
The pooling operation in this particular ConvNet is not the max-pooling or
average-pooling operations. Instead, it sums the four inputs and divides them by the
trainable parameter a and adds the trainable bias b to this result. Also, the activation
functions are applied after the pooling layer and there is no activation function after
the convolution layers. In this ConvNet, the sigmoid activation functions are used.
One important question that we have to always ask is that how many parameters
are there in the ConvNet that we have designed. Let us compute this quantity for
LeNet-5. The first layer consists of six filters of size 5 × 5 × 1. Assuming that
each filter has also a bias term, C1 is formulated using 6 × 5 × 5 × 1 + 6 =
156 trainable parameters. Then, in this particular ConvNet, each pooling unit is
formulated using two parameters. Hence, S2 contains 12 trainable parameters. Then,
taking into account the fact that C3 is composed of 16 filters of size 5 × 5 × 6, it
will contain 16 × 5 × 5 × 6 + 16 = 2416 trainable parameters. S4 will also contain
32 parameters since each pooling unit is formulated using two parameters. In the
case of C5, it consists of 120 × 5 × 5 × 16 + 120 = 48120 parameters. Similarly,
F6 contains 84 × 120 + 84 = 10164 trainable parameters and the output includes
10 × 84 + 10 = 850 trainable parameters. Therefore, the LeNet-5 ConvNet requires
training 156 + 12 + 2416 + 32 + 48120 + 10164 + 850 = 61750 parameters.

100

3 Convolutional Neural Networks

Fig. 3.12 Representing AlexNet using a DAG

3.4 AlexNet
In 2012, Krizhevsky et al. (2012) trained a large ConvNet on the ImageNet dataset
(Deng et al. 2009) and won the image classification competition on this dataset. The
challenge is to classify 1000 classes of natural objects. Afterwards, this ConvNet
became popular and it was called AlexNet.3 The architecture of this ConvNet is
shown in Fig. 3.12.
In this diagram, Slice illustrates a node that slices a feature maps through its depth
and Concat shows a node that concatenates feature maps coming from different
nodes across their depth. The first convolution layer in this ConvNet contains 96
filters of size 11 × 11 which are applied with stride s = 4 on 224 × 3 images.
Then, the ReLU activation is applied on the 96 feature maps. After that a 3 × 3 max
pooling with stride s = 2 is applied on activation maps. Before applying the second
convolution layer, the 96 activation maps is divided into two 48 channel maps. The
second convolution layer consists of 256 filters of size 5 × 5 × 48 in which the first
128 filters are applied on the first 48-channel map from the first layer and the second
128 filters are applied on the second 48-channel map from the first layer. The feature
maps of the second convolution layer are passed through the ReLU activation and a
max-pooling layer. The third convolution layer has 384 filters of size 3 × 3 × 256.
It turns out that each filter in the third convolution layer is connected to both
128-channel maps from the second layer. The ReLU activation is applied on the
third convolution layer but there is not pooling after the third convolution layer. At
this point, the third convolution layer is divided into two 192-channel maps. The
fourth convolution layer in this ConvNet has 384 filters of size 3 × 3 × 192. As
before, the first 192 filters are connected to the first 192-channel map from the third
convolution layer and the second 192 filters are connected to the second 192-channel
map from the third convolution layer. The output of the fourth convolution layer is
passed through a ReLU activation and it directly goes into the fifth convolution layer

3 Alex

is the first name of the first author.

3.4 AlexNet

101

without passing through a pooling layer. The fifth convolution layer has 256 filters
of size 3 × 3 × 192 where each of 128 filters is connected to one 192-channel feature
map from the fourth layer. Here, output of this layer goes into a ReLU activation
and it is passed through a max-pooling layer. Finally, there are two consecutive fully
connected layers each containing 4096 neurons and ReLU activation after the fifth
convolution layer. The output of the ConvNet is also a fully connected layer with
1000 neurons.
AlexNet has 60,965,224 trainable parameters. Also, it is worth mentioning that
there are local response normalization (LRN) layers after some of these layers. We
will explain this layer in this chapter. In short, a LRN layer does not have any trainable
parameter and it applies a nonlinear transformation on feature maps. Also, it does
not change any of the dimensions of feature maps.

3.5 Designing a ConvNet
In general, one of the difficulties in neural networks is finding a good architecture
which produces accurate results and it is computationally efficient. In fact, there is
no golden rule in finding such an architecture. Even people with years of experience
in neural networks may require many trials to find a good architecture.
Arguably, the practical way is to immediately start with an architecture, implement, and train it on the training set. Then, the ConvNet is evaluated on the validation
set. If the results are not satisfactory, we change the architecture or hyperparameters
of the ConvNet and repeat the aforementioned procedure. This approach is illustrated
in Fig. 3.13.
In the rest of this section, we thoroughly explain each of these steps.

Fig. 3.13 Designing a
ConvNet is an iterative
process. Finding a good
architecture may require
several iterations of
design–implement–evaluate

102

3 Convolutional Neural Networks

3.5.1 ConvNet Architecture
Although there is no golden rule in designing a ConvNet, there are a few rule-ofthumbs that can be found in many successful architectures. A ConvNet typically
consists of several convolution-pooling layers followed by a few fully connected
layers. Also, the last layer is always the output layer. From another perspective, a
ConvNet is a directed acyclic graph (DAG) with one leaf node. In this DAG, each
node represents a layer and edges show the connection between layers.
A convenient way of designing a ConvNet is to use a DAG diagram like the one
illustrated in Fig. 3.12. One can define other nodes or combine several nodes into one
node. You can design any DAG to represent a ConvNet. However, two rules have
to be followed. First, there is always one leaf node in a ConvNet which represents
the classification layer or the loss function. That does not make sense to have more
than one classification layer in a ConvNet. Second, inputs of a node must have the
same spatial dimension. The exception could be the concatenation node where you
can also concatenate the inputs spatially. As long as these two rules are observed in
the DAG, the architecture is valid and correct. Also, all operations represented by
nodes in the DAG must be differentiable so that the backpropagation algorithm can
be applied on the graph.
As the first rule of thumb, remember to always compute the size of feature maps
for each node in the DAG. Usually, nodes that are connected to fully connected
layers have spatial size less than 8 × 8. Common sizes are 2 × 2, 3 × 3, and 4 × 4.
However, the channels (third dimension) of the nodes connecting to fully connected
layer could be any arbitrary size. The second rule of thumb is that the number of
feature maps, usually, has a direct relation with depth of each node in the DAG. That
means we start with small number of feature maps in early layers and the number
of feature maps increases as they the depth of nodes increases. However, some flat
architectures have been also proposed in literature where all layers have the same
number of feature maps or they have a repetitive pattern.
The third rule of thumb is that state-of-the-art ConvNets commonly use convolution filters of size 3 × 3, 5 × 5, and 7 × 7. Among them, AlexNet is the only one
that has utilized 11 × 11 convolution filters. The fourth rule of thumb is activation
functions usually come immediately after a convolution layer. However, there are a
few works that put the activation function after the pooling layer. As the fifth rule of
thumb remember that while putting several convolution layers consecutively makes
sense, it is not common to add two or more consecutive activation function layers.
The sixth rule of thumb is to use an activation function from the family of ReLU
function (ReLU, Leaky ReLU, PReLU, ELU or Noisy ReLU). Also, always compute
the number of trainable parameters of your ConvNet. If you do not have plenty of
data and you design a ConvNet with millions of parameters, the ConvNet might not
generalize well on the test set.
In the simplest scenario, the idea in Fig. 3.13 might be just designing a ConvNet.
However, there are many other points that must be considered. For example, we may
need to preprocess the data. Also, we have to split the data into several parts. We
shall discuss about this later in this chapter. For now, let just assume that the idea

3.5 Designing a ConvNet

103

refers to designing a ConvNet. Having the idea defined clearly, the next step is to
implement the ideas.

3.5.2 Software Libraries
The main scope of this book is ConvNets. For this reason, we will only discuss how
to efficiently implement a ConvNet in a practical application. Other ideas such as
preprocessing data or splitting data might be done in any programming languages.
There are several commonly used libraries for implementing ConvNets which are
actively updated as new methods and ideas are developed in this field. There have
been other libraries such as cudaconvnet which are not active anymore. Also, there
are many other libraries in addition to the following list. But, the following list is
widely used in academia as well as industry:
•
•
•
•
•
•
•
•

Theano (deeplearning.net/software/theano/)
Lasagne (lasagne.readthedocs.io/en/latest/)
TensorFlow (www.tensorflow.org/)
Keras (keras.io/)
Torch (torch.ch/)
cuDNN (developer.nvidia.com/cudnn)
mxnet (mxnet.io)
Caffe (caffe.berkeleyvision.org/)

3.5.2.1 Theano
Theano is a library which can be used in Python for symbolic numerical computations. In this library, a computational DAG such as ConvNet is defined using symbolic
expressions. Then, the symbolic expressions are compiled using its built-in compiler
into executable functions. These functions can be called similar to other functions in
Python. There are two important features in Theano which are very important. First,
based on the user configuration, compiling the functions can be done either on CPUs
or a GPU. Even a user with little knowledge about GPU programming can easily use
this library for running heavy expressions on GPU.
Second, Theano represents any expression in terms of computational graphs. For
this reason, it is also able to compute the gradient of a leaf node with respect to all
other nodes in the graph automatically. For this reason, user can easily implement
gradient based optimization algorithms to train a ConvNet. Gradient of convolution
and pooling layers are also computed efficiently using Theano. These features make
Theano a good choice for doing research. However, it might not be easily utilized in
commercial products.

104

3 Convolutional Neural Networks

3.5.2.2 Lasagne
Despite its great power, Theano is a low-level library. For example, every time that
you need to design a convolution layer followed by the ReLU activation, you must
write codes for each part separately. Lasagne has been built on top of Theano and it has
developed the common patterns in ConvNets so you do not need to implement them
every time. In fact, using Lasagne, you can only design neural networks including
ConvNets. Nonetheless, to use Lasagne one must have the basic knowledge about
Theano as well.

3.5.2.3 TensorFlow
TensorFlow is another library for numerical computations. It has interfaces for both
Python and C++. Similar to Theano it expresses mathematical equations in terms of
a DAG. It supports automatic differentiation as well. Also, it can compile the DAG
on CPUs or GPUs.

3.5.2.4 Keras
Keras is a high-level library which is written in Python. Keras is able to run either
of TensorFlow or Theano depending on the user configuration. Using Keras, it is
possible to rapidly develop your idea and train it. Note that it is also possible to
rapidly develop the ideas in Theano and TensorFlow.

3.5.2.5 Torch
Torch is also a library for scientific computing and it supports ConvNets. It is based
on Lua programming language and uses the scripting language LuaJIT. Similar to
other libraries it supports computations on CPUs and GPUs.

3.5.2.6 cuDNN
cuDNN has been developed by NVIDIA and it can be used only for implementing
deep neural networks on GPUs created by NVIDIA. Also, it supports forward and
backward propagation. Hence, not only it can be used in developing products but
it can be also used for training ConvNets. cuDNN is a great choice for commercial
products. In fact, all other libraries in our list use cuDNN for compiling their code
on GPUs.

3.5.2.7 mxnet
Another commonly used library is called mxnet.4 Similar to Theano, Tensorflow
and Torch it supports auto differentiation and symbolic expressions. It also supports

4 mxnet.io.

3.5 Designing a ConvNet

105

Fig. 3.14 A dataset is usually partitioned into three different parts namely training set, development
set and test set

distributed computing which is very useful in the case that you want to train a model
on several GPUs.

3.5.2.8 Caffe
Caffe is the last library in our list. It is written in C++ and it has interfaces for
MATLAB and Python. It can be only used for developing deep neural networks.
Also, it supports all state-of-the-art methods proposed in community for ConvNets.
It can be used both for research and commercial products. However, developing new
layers in Caffe is not as easy as Theano, TensorFlow, mxnet, or Torch. But, creating
a ConvNet to solve a problem can be done quickly and effectively. More importantly,
the trained ConvNet can be easily ported on embedded systems. We will develop all
our ConvNets using the Caffe library. In the next chapter, we will mention how to
design, train, and test a ConvNet using Caffe library.
There are also other libraries such as Deeplearning4j,5 Microsoft Cognitive
Toolkit,6 Pylearn27 and MatConvNet.8 But, the above list is more common in
academia.

3.5.3 Evaluating a ConvNet
After implementing your idea using one of the libraries in the previous section, it is
time to evaluate how good is the idea for solving the problem. Concretely, evaluation must be done empirically using a dataset. In practice, evaluation is done using
three different partitions of data. Assume the dataset X = {(x0 , y0 ), . . . , (xn , yn )}
containing n samples where xi ∈ RW ×H×3 is a color image and yi ∈ {1, . . . , c} is
its corresponding class label. This dataset must be partitioned into three disjoint sets
namely training set, development set and test set as it is illustrated in Fig. 3.14.
Formally, the dataset X is partitioned into Xtrain , Xdev and Xtest such that
X = Xtrain



Xdev



Xtest

5 deeplearning4j.org.
6 www.microsoft.com/en-us/research/product/cognitive-toolkit/.
7 github.com/lisa-lab/pylearn2.
8 www.vlfeat.org/matconvnet.

(3.13)

106

3 Convolutional Neural Networks

and
Xtrain



Xdev = Xtrain



Xtest = Xdev



Xtest = ∅.

(3.14)

The training set will be only and only used during training (i.e., minimizing loss
function) the ConvNet. During training the ConvNet, its performance is regularly
evaluated on the development set. If the performance is not acceptable, we go back
to the idea and refine the idea or design a new idea from scratch. Then, the new
idea will be implemented and trained on the same training set. Then, it is evaluated
on the development set. This procedure will be repeated until we are happy with
the performance of model on the development set. After that, we carry out a final
evaluation using the test set. The performance on the test set will tell us how good
our model will be in real world. It is worth mentioning that the development set is
commonly called validation set. In this book, we use validation set and development
set interchangeably.
Splitting data into three partitions is very important step toward developing a
good and reliable model. We should note that evaluation on the test set is done
only once. We never try to refine our model based on the performance of the test
set. Instead, if we see that the performance on the test set is not acceptable and we
need to develop a new idea, the new idea will be refined and evaluated only on the
training and development sets. The test set will be only used to ascertain whether or
not the model is good for the real-world application. If we refine the idea based on
performance of test set rather than the development set we may end up with a model
which might not yield accurate results in practice.

3.5.3.1 Classification Metrics
Evaluating a model on the development set or the test set can be done using classification metric functions or simply a metric function. On the one hand, the output of
a ConvNet which is trained for a classification task is the label (class) of its input.
We call the label produced by a ConvNet predicted label. On the other hand, we also
know the actual label of each sample in Xdev and Xtest . Therefore, we can use the
predicted labels and actual labels of samples in these sets to assess our ConvNet.
Mathematically, a classification metric function usually accepts the actual labels and
predicted labels and returns a score or set of scores. The following metric functions can be applied on any classification dataset regardless if it is a training set,
development set or test set.

3.5.3.2 Classification Accuracy
The simplest metric function for the task of classification is the classification accuracy. It calculates fraction of samples that are classified correctly. Given the set
X  = {(x1 , y1 ), . . . , (xN , yN )} containing N pair of samples, the classification score

3.5 Designing a ConvNet

107

is computed as follows:
accuracy =

N
1 
1[yi == ŷi ]
N

(3.15)

i=1

where yi and ŷi are the actual label and the predicted label of the ith sample in X  .
Also, 1[.] returns 1 when the value of its argument evaluates to True and 0 otherwise.
Clearly, accuracy will take a value in [0, 1]. If the accuracy is equal to 1 that means
all the samples in X  are classified correctly. In contrast, if accuracy is equal to 0
that means none of the samples in X  is classified correctly.
Computing accuracy is straightforward and it is commonly used for assessing
classification models. However, accuracy posses one serious limitation. We explain
this limitation using an example. Assume the set X  = {(x1 , y1 ), . . . , (x3000 , y3000 )}
with 3000 samples where yi ∈ 1, 2, 3 show that samples in this dataset belongs to one
of three classes. Suppose 1500 samples within X  belong to class 1, 1400 samples
belong to class 2 and 100 samples belong to class 3. Further assume that all samples
belonging to class 1 and 2 are classified correctly but all samples belonging to class
3 are classified incorrectly. In this case, the accuracy will be equal to 2900
3000 = 0.9666
showing that 96.66% of samples in X  are classified correctly. If we only look at
the accuracy, we might think that 96.66% is very accurate for our application and
we decide that our ConvNet is finalized.
However, the accuracy in the above example is high because the number of samples
belonging to class 3 is much less than the number of samples belonging to class 1
or 2. In other words, the set X  is imbalanced. To alleviate this problem, we can set
a weight for each sample where the weight of a sample in class A is proportional to
the number of samples in class A and total number of samples in X  . Based on this
formulation, the weighted accuracy is given by
accuracy =

N
1 
wi × 1[yi == ŷi ]
N

(3.16)

i=1

where wi denotes the weight of ith sample. If there are C classes in X  , the weight
of a sample belonging to class A is usually equal to
1
.
C × number of samples in class A

(3.17)

1
=
In the above example, weights of samples of class 1 will be equal to 3×1500
1
0.00022 and weights of samples of class 2 will be equal to 3×1400 = 0.00024. Sim1
ilarly, the weight of samples of class 3 will be equal to 3×100
= 0.0033. Computing
1
the weighted accuracy in the above example, we will obtain 1500 × 3×1500
+ 1400 ×
1
1
+
0
×
=
0.6666
instead
of
0.9666.
The
weighted
accuracy
gives
us a
3×1400
3×100
better estimate of performance in this particular case.

108

3 Convolutional Neural Networks

There is still another limitation with the accuracy metric even in perfectly balanced
datasets. Assume that there are 200 different classes in X  and there are 100 samples
in each class yielding 20,000 samples in X  . Assume all the samples belonging to
class 1 to 199 are classified correctly and all of the samples belonging to class 200 are
classified incorrectly. In this case, the accuracy score will be equal to 19900
20000 = 0.995
showing nearly perfect classification. Even with the above weighting approach the
accuracy will be still equal to 0.995.
In general, the accuracy score shows a rough evaluation of the model and it might
not be a reliable metric for making final decisions about a model. However, the above
examples are hypothetical and it may never happen in practice that all the samples
from one class are classified correctly and all the samples from other classes are
classified correctly. The above hypothetical example is just to show the limitation of
this metric. In practice, the accuracy score is commonly used for assessing models.
But, a great care must be taken into account when you are evaluating your model
using classification accuracy.

3.5.3.3 Confusion Matrix
Confusion matrix is a powerful metric for accurately evaluating classification models.
For a classification problem with C classes, confusion matrix M is a C × C matrix
where element Mij in this matrix shows the number of samples in X  whose actual
class label are i but they are classified as class j using our ConvNet. Concretely,
Mii shows the number of samples which are correctly classified. We first study
the confusion matrix on binary classification problems. Then, we will extend it to
multiclass classification problems. There are only two classes in binary classification
problems. Consequently, the confusion matrix will a 2 × 2 matrix. Figure 3.15 shows
the confusion matrix for a binary classification problem.
Element M11 in this matrix shows the number of samples whose actual labels are 1
and they are classified as 1. Technically, this element of matrix is called true-positive
(TP) samples. Element M12 shows the number of samples whose actual label is 1 but
they are classified as −1. This element is called false-negative (FN) samples. Element
M21 denotes the number of samples whose actual label is −1 but they are classified
as 1. Hence, this element is called false-positive (FP) samples. Finally, element M22 ,
that is called true-negative (TN) samples, illustrates the number of samples which

Fig. 3.15 For a binary
classification problem,
confusion matrix is a 2 × 2
matrix

3.5 Designing a ConvNet

109

Fig. 3.16 Confusion matrix
in multiclass classification
problems

are actually −1 and they are classified as −1. Based on this formulation, the accuracy
is given by:
accuracy =

TP + TN
TP + TN + FP + FN

(3.18)

Concretely, a ConvNet is a perfect classifier if FP = FN = 0. The confusion matrix
can be easily extended to multiclass classification problem. For example, Fig. 3.16
shows a confusion matrix for five-class classification problems. A ConvNet for this
matrix is a perfect classifier if all non-diagonal elements of this matrix are zero. The
terms TP, FP, and FN can be extended to this confusion matrix as well.
For any class i in this matrix,
FNi =



Mij

(3.19)

j=i

returns the number of false-negative samples for the ith class and
FPi =



Mji

(3.20)

j=i

returns the number of false-positive samples for the ith class. In addition, the accuracy
is given by

i Mii
 
.
(3.21)
i
j Mji
Studying the confusion matrix tells us how good is our model in practice. Using this
matrix, we can see which classes causes trouble in classification. For example if M33
is equal to 100 and M35 is equal to 80 and all other elements in the same row are
zero, this shows that the classifier makes mistake by classifying samples belonging
to class 3 as class 5 and it does not make any mistake with other classes in the same
row. A similar analysis can be done on columns of a confusion matrix.

110

3 Convolutional Neural Networks

In general, confusion matrix is a very powerful tool for assessing a classifier. But,
it might be tedious or even impractical to analyze a confusion matrix on a 250-class
classification problem. Making sense of a large confusion matrix is a hard task and
sometimes nearly impossible. For this reason, we usually extract some quantitative
measures from confusion matrix which are more reliable and informative compared
with the accuracy score.

3.5.3.4 Precision and Recall
Precision and recall are two important quantitative measures for assessing a classifier. Precision computes the fraction of predicted positives and recall computed the
fraction of actual positives. To be more specific, precision is given by
precision =

TP
TP + FP

(3.22)

and recall is computed by
TP
.
(3.23)
TP + FN
Obviously, FP and FN must be zero in a perfect classifier leading to precision and
recall scores equal to 1. If precision and recall are both equal to 1, we can say that
the classifier is perfect. If these quantities are close to zero, we can imply that the
classifier is very inaccurate. Computing precision and recall on binary matrix is
trivial. In the case of multiclass classification problem, precision of the ith class is
given by
recall =

precisioni =

Mii +

Mii


Mii
=
j=i Mji
j Mji

(3.24)

and the recall of the ith class is given by:
recalli =

Mii +

Mii


j=i Mij

Mii
=
.
j Mij

(3.25)

Considering that there are C classes, the overall precision and recall of a confusion
matrix can be computed as follows:
precision =

C


wi × precisioni

(3.26)

wi × recalli .

(3.27)

i=1

recall =

C

i=1

If we set wi to 1, the above equations will simply compute average of precisions and
recalls in a confusion matrix. However, if wi is equal to the number of samples in the

3.5 Designing a ConvNet

111

ith class divided by total number of samples, the above equation will compute the
weighted average of precisions and recall taking into account the imbalanced dataset.
Moreover, you may also compute the variance of precisions and recalls beside the
weighted mean in order to see how much these values are fluctuating in the function
matrix.

3.5.3.5 F1 Score
While precision and recall are very informative and useful for assessing a ConvNet,
in practice, we are usually interested in designing and evaluating ConvNets based on
a single quantity. One effective way to achieve this goal is to combine the values of
precision and recall. This can be simply done by computing the average of precision
and recall. However, computing the average of these two quantities might not produce
accurate results. Instead, we can compute the harmonic mean of precision and recall
as follows:
F1 =

2
1
precision

+

1
recall

=

2TP
2TP + FP + FN

(3.28)

This harmonic mean is called F1-score which is a number in [0, 1] with
F1-score equal to 1 showing a perfect classifier. In the case of multiclass classification problems, F1-score can be simply computed by taking the weighted average
of class specific F1-scores (similar method that we used for precision and recall in
the previous section).
F1-score is a reliable and informative quantity to evaluate a classifier. In practice,
we usually evaluate the implemented ideas (ConvNets) using the F1-score on development set and refine the idea until we get a satisfactory F1-score. Then, a complete
analysis can be done on the test set using confusion matrix and its related metrics.

3.6 Training a ConvNet
Training a ConvNet can be done in several ways. In this section, we will explain best
practices for training a ConvNet. Assume the training set Xtrain . We will use this set
for training the ConvNet. However, Coates and Ng (2012) showed that preprocessing
data is helpful for training a good model. In the case of ConvNets applied on images,
we usually compute the mean image using the samples in Xtrain and subtract it
from each sample in the whole dataset. Formally, the mean image is obtained by
computing
1 
x̄ =
xi .
(3.29)
N
xi ∈Xtrain

112

3 Convolutional Neural Networks

Then, each sample in the training set as well as the development set and the test set
is replaced by
xi = xi − x̄
xi = xi − x̄
xi = xi − x̄

∀xi ∈ Xtrian
∀xi ∈ Xdev
∀xi ∈ Xtest

(3.30)

Note that the mean image is only computed on the training set but it is used to
preprocess the development and test sets as well. Subtracting mean is very common
and helpful in practice. It translates the whole dataset such that the expected value
(mean) of the dataset is located very close to origin in the image space. In the case
of neural networks designed by hyperbolic tangent activation functions, subtracting
mean from data is crucial since it guarantees that the activation of first layer will be
close to zero and gradient of network will be close to one. Hence, the network will
be able to learn from data.
To further preprocess the dataset, we can compute the variance of every element
of xi . This can be easily obtained by computing
var(Xtrain ) =

N
1 
(xi − x̄)2
N

(3.31)

i=1

where N is the total number of samples in the training set. The square and division
operations in the above equation are applied in the elemenwise fashion. Assuming
that xi ∈ RH×W ×3 , var(Xtrain ) will have the same size as xi . Then, (3.30) can be
written as
∀xi ∈ Xtrian
xi − x̄
xi =
var(Xtrain )
∀xi ∈ Xdev
xi − x̄
(3.32)
xi =
var(Xtrain )
∀xi ∈ Xtest
xi − x̄
xi =
var(Xtrain )
Beside translating the dataset into origin, the above transformation also changes the
variance of each element in the input so it will be equal to 1 for each element. This
preprocessing technique is commonly known as mean-variance normalization. As
before, computing the variance is only done using the data in the training set and it
is used for transforming data in the development and test sets as well.

3.6.1 Loss Function
Two commonly used loss functions for training ConvNets are the multiclass version
of the logistic loss function and the multiclass hinge loss function. It is also possible
to define a loss function which is equal to weighted sum of several loss functions.

3.6 Training a ConvNet

113

However, it is not a common approach and we usually train a ConvNet using only
one loss function. These two loss functions are throughly explained in Chap. 2.

3.6.2 Initialization
Training a ConvNet successfully using a gradient-based method without a good
initialization is nearly impossible. In general, there are two sets of parameter in a
ConvNet including weights and biases. We usually set all the biases to zero. In this
section, we will describe a few techniques for initializing weights that have produced
promising results in practice.

3.6.2.1 All Zero
The trivial method for initializing weights is to set all of them to zero. Concretely, this
will not work since all neurons will produce the same signal during backpropagation
and weights will be updated using exactly the same rule. This means that the ConvNet
will not be trained properly.

3.6.2.2 Random Initialization
The better idea is to initialize the weights randomly. The random values might be
drawn from a Gaussian distribution or a uniform distribution. The idea is to generate
small random numbers. To this end, the mean of Gaussian distribution is usually
fixed at 0 and its variance is fixed at a value such as 0.001. Alternatively, it is also
possible to generate random numbers by a uniform distribution where the minimum
and maximum value of the distribution are fixed at numbers close to zero such
as ±0.001. Using this technique, each neuron will produce different output in the
forward pass. As the result, the update rule of each neuron will be different from
other neurons and the ConvNet will be trained properly.

3.6.2.3 Xavier Initialization
As it is illustrated in Sutskever et al. (2013) and Mishkin and Matas (2015), initialization has a great influence in training a ConvNet. Glorot and Bengio (2010)
proposed an initialization technique which has been one of the successful methods
of initialization so far. This initialization is widely known as Xavier initialization.9
As we saw in Chap. 2, the output of a neuron in a neural network is given by
z = w1 x1 + · · · + wd xd

9 Xavier

is the first name of the first author.

(3.33)

114

3 Convolutional Neural Networks

where xi ∈ R and wi ∈ R are ith input and its corresponding weight. If we compute
the variance of z, we will obtain
V ar(z) = V ar(w1 x1 + · · · + wd xd ).

(3.34)

Taking into account the properties of variance, the above equation can be decomposed
to
V ar(z) =

d


V ar(wi xi ) +



Cov(wi xi , wj xj ).

(3.35)

i=j

i=1

In the above equation, Cov(.) denotes the covariance of inputs. Using the properties
of variance, the first term in this equation can be decomposed to
var(wi xi ) = E[wi ]2 V ar(xi ) + E[xi ]2 V ar(yi ) + V ar(wi )V ar(xi )

(3.36)

where E[.] denotes the expected value of random variable. Assuming that the meanvariance normalization have been applied on the dataset, the second term will be equal
to zero in the above equation since E[xi ] = 0. Consequently, it will be reduced to
var(wi xi ) = E[wi ]2 V ar(xi ) + V ar(wi )V ar(xi ).

(3.37)

Suppose we want the expected value of weight to be equal to zero. In that case, the
above equation will be reduced to
var(wi xi ) = V ar(wi )V ar(xi )

(3.38)

By plugging the above equation in (3.35), we will obtain
V ar(z) =

d


V ar(wi )V ar(xi ) +



Cov(wi xi , wj xj ).

(3.39)

i=j

i=1

Assuming that wi and xi are independent and identically distributed, the second term
in the above equation will be equal to zero. Also, we can assume that V ar(wi ) =
V ar(wj ), ∀i, j. Taking into account these two conditions, the above equation will be
simplified to
V ar(z) = d × V ar(wi )V ar(xi ).

(3.40)

Since the inputs have been normalized using the mean-variance normalization,
V ar(xi ) will be equal to 1. Then
V ar(wi ) =

1
d

(3.41)

where d is the number of inputs to the current layer. The above equation tells us that
the weights of current layer can be initialized using the Gaussian distribution with

3.6 Training a ConvNet

115

mean equal to zero and variance equal to d1 . This technique is the default initialization
technique in the Caffe library. Glorot and Bengio (2010) carried out a similar analysis
on the backpropagation step and concluded that the current layer can be initialized
by setting the variance of Gaussian distribution to
V ar(wi ) =

1
nout

(3.42)

nout is the number of outputs of the layer. Later, He et al. (2015) showed that for a
ConvNet with ReLU layers, the variance can be set to
V ar(wi ) =

2
nin + nout

(3.43)

where nin = d is the number of inputs to the layer. Despite many simplifying assumptions, all three techniques for determining the value of variance works very well with
ReLU activations.

3.6.3 Regularization
So far in this book, we explained how to design, train, and evaluate a ConvNet. In this
section, we bring up another topic which has to be considered in training a ConvNet.
Assume the binary dataset illustrated in Fig. 3.17. The blue solid circles and the red
dashed circles show training data of two classes. Also, the dash-dotted red circle is
the test data.
This figure shows how the space might be divided into two regions if we fit a
linear classifier on the training data. It turns out that the small solid blue circle has
been ignored during because any line that classifies this circle correctly will have

Fig. 3.17 A linear model is
highly biased toward data
meaning that it is not able to
model nonlinearities in the
data

116

3 Convolutional Neural Networks

Fig. 3.18 A nonlinear model
is less biased but it may
model any small nonlinearity
in data

Fig. 3.19 A nonlinear model
may still overfit on a training
set with many samples

a higher loss compared with the line in this figure. If the linear model is evaluated
using the test data it will perfectly classify all its samples.
However, suppose that we have created a feedforward neural network with one
hidden layer in order to make this dataset linearly separable. Then, a linear classifier
is fitted on the transformed data. As we saw earlier, this is equivalent to a nonlinear decision boundary in the original space. The decision boundary, may look like
Fig. 3.18.
Here, we see that model is able to perfectly distinguish the training samples.
However, if it is assessed using the test set, none of the samples in this set will be
classified correctly. Technically, we say the model is overfitted on the training data
and it has not been generalized on the test set. The obvious cure for this problem
seems to be gathering more data. Figure 3.19 illustrates a scenario where the size of
training set is large.
Clearly, the system works better but it still classifies most of test samples incorrectly. One reason is that the feedforward neural network may have many neurons

3.6 Training a ConvNet

117

in the hidden layer and, hence, it is a highly nonlinear function. For this reason, it is
able to model even small nonlinearities in the dataset. In contrast, a linear classifier
trained on the original data is not able to model nonlinearities in the data.
In general, if a model highly nonlinear and it is able to learn any small nonlinearities in the data, we say that the model has high variance. In contrast, if a model is
not able to learn nonlinearities in a data we say it has a high bias. A model with high
variance is prone to overfit on data which can adversely reduce the accuracy on test
set. In contrary, a highly biased model is not able to deal with nonlinear datasets.
Therefore, it is not able to accurately learn from data.
The important point in designing and training is to find a trade-off between model
bias and model variance. But, what does cause a neural network to have high variance/bias? This mainly depends on two factors. These two factors are the number of
neurons/layer in a neural network and the magnitude of weights. Concretely, a neural
network with many neurons/layers is capable of modeling very complex functions.
As the number of neurons increases its ability to model highly nonlinear function
increases as well. In opposite, by reducing the number of neurons, the ability of a
neural network for modeling highly nonlinear functions decreases.
A highly nonlinear function has different characteristics. One of them is that
a highly nonlinear function is differentiable several times. In the case of neural
networks with sigmoid activation, the neural networks are infinitely differentiable.
A neural network with ReLU activations could be also differentiable several times.
Assume a neural network with sigmoid activations. If we compute the derivative of
output with respect to its input, it will depend on the values of weights. Since the
neural network is differentiable several times (infinitely in this case), the derivative
is also a nonlinear function.
It turns out that the derivative of the function for a given input will be higher if the
weights are also higher. As the magnitude of weights increases, the neural network
become more capable to model sudden variations. For example, Fig. 3.20 shows the
two decision boundaries generated by a feedforward neural network with four hidden
layers.
The neural network is initialized with random numbers between −1 and 1. The
decision boundary associated with these values is shown in the left. The decision
boundary in the right plot is obtained using the same neural network. We have only
multiplied the weights of the third layer with 10 in the right plot. As we can see, the
decision boundary in the left plot is smooth but the decision boundary in the right
plot is spiky with sharp changes. These sharp changes sometimes causes a neural
network to overfit on the training set.
For this reason, we have to keep the magnitude of weights close to zero in order to
control the variance of our model. This is technically called regularization and it is
an important step in training a neural network. There are different ways to regularize
a neural network. In this section, we will only explain the methods that are already
implemented in the Caffe library.

118

3 Convolutional Neural Networks

3.6.3.1 L2 Regularization
Let us denote the weights of all layers in a neural network using W. A simple
but effective way for regularizing a neural network is to compute L2 norm of the
weights and add it to the loss function. This regularization technique is called L2
regularization. Formally, instead of minimizing L (x), we define the loss function as
Ll2 (x) = L (x) + λW 2

(3.44)

where W 2 is the L2 norm of the weights and λ is a user-defined value showing that
how much the regularization term can penalize the loss function. The regularization
term will be minimized when all the weights are zero. Consequently, the second term
encourages the weights to have small values. If λ is high, the weights will be very
close to zero which means we reduce the variance of our model and increase its bias.
In contrast, if λ is small, we let the weights to take higher values. Therefore, the
variance of model increases.
A nice property of L2 regularization is that it does not produce spiky weights where
a few of weights might be much higher than other weights. Instead, it distributes the
weights evenly so the weight vector is smooth.

3.6.3.2 L1 Regularization
Instead of L2 norm, L1 regularization penalizes the loss function using L1 norm of
weights vectors. Formally, the penalized loss function is given by
Ll1 (x) = L (x) + λ|W |

(3.45)

where |W | is the L1 norm of the weights and λ is a user-defined value and has the
same effect as in L2 regularization. In contrast to L2 regularization, L1 regularization

Fig. 3.20 A neural network with greater weights is capable of modeling sudden changes in the
output. The right decision boundary is obtained by multiplying the third layer of the neural network
in left with 10

3.6 Training a ConvNet

119

can produce sparse weight vectors in which some of the weights are very close
to or exactly zero. However, this property is not guaranteed if we optimize the L1
regularized loss function using the gradient descend algorithm.
From another perspective, L1 regularization select features that are useful for the
classification task in hand. This is done by making weights of irrelevant features.
However, if there is no need to do a feature selection, L2 regularization is preferred
over L1 regularization. It is also possible to combine L2 and L1 regularizations and
obtain
Ll1l2 (x) = L (x) + λ1 |W | + λ2 W 2 .

(3.46)

The above regularization is called elastic net. But, training a ConvNet using the
above combined regularization is not common and in practice we mainly use L2
regularization.

3.6.3.3 Max-Norm Regularization
The previous two regularization methods are applied by adding a penalizing term
to the loss function. Max-norm regularization does not penalize the loss function.
Instead, it always keep W  within a ball of radius c. Formally, after computing the
gradient and applying the update rule on the weights, we compute W  (L2 norm of
weights) and if they exceed the user-defined threshold c, the weights are projected
to the surface of the ball with radius c using
W =

W
×c
W 

(3.47)

One interesting property of the max-norm regularization is that it prevents the neural
network to explode. In other words, we previously see that gradient may vanish in
deep networks during backpropagation in which case the deep network does learn
properly. This phenomena is called gradient vanishing problem. In contrast, gradients might be greater than one in a deep neural network. In that case, gradient become
higher as backpropagation moves to the first layers. In this case, the weights suddenly explodes and become very large. This phenomena is called exploding gradient
problem. In addition, if learning rate in the gradient descend algorithm is set to a high
value, the network may explode. However, applying max-norm regularization on the
weight prevents the network to explode since it always keep the norm of weights
below the threshold c.

3.6.3.4 Dropout
Dropout (Hinton 2014) is another technique for regularizing a neural network and
preventing it from overfitting. For each neuron in the network, it generates a number
between 0 and 1 using the uniform distribution. If the probability of a neuron is less
than p, the neuron will be dropped out from the network along with all its connections.
Then, the forward and backward passes will be computed on the new network. This

120

3 Convolutional Neural Networks

Fig. 3.21 If dropout is
activated on a layer, each
neuron in the layer will be
attached to a blocker. The
blocker blocks information
flow in the forward pass as
well as the backward pass
(i.e., backpropagation) with
probability p

process of dropping some neurons from the original network and computing the
forward and backward pass on the new network is repeated for every sample in the
training set.
In other words, for each sample in the training set, a subset of neuron from the
original network is selected to form a new network and the forward pass is computed
using this smaller network. Likewise, the backward pass is computed on the small
network and the weights of the smaller network are updated. Then, the weights are
copied to the original network. The above procedure seems complicated. But, it can
be efficiently implemented. This is illustrated in Fig. 3.21.
First, we can define a new layer called dropout. This layer can be connected to any
other layers such as convolution or fully connected layers. The number of elements
in this layer is equal to the number of outputs in the previous layer. There is only
one parameter in this layer which is called dropout ratio which is denoted by p and
it is defined by user during designing the network. This layers has been shown using
black squares in this Figure. For each element in this layer a random number between
0 and 1 is generated using the uniform distribution. If the generated number for the
ith is greater than p, it will pass the output of the neuron from previous layer to the
next layer. Otherwise, it will block the output of the previous neuron and send 0 to
the next layer. Since the blocker is activated for the ith neuron during the forward
pass, it will also block the signals coming to this element during backpropagation
and will pass (backward pass) 0 to the neuron in the previous layer. This way, the
gradient of the neuron in the previous layer will be equal to zero. Consequently, the
ith will have no effect on the forward or backward pass which is similar to dropping
out this neuron from the network.
In the test time, if we execute the forward pass several times we are likely to
get different outputs from the network. This is due to the fact that the dropout layer
blocks the signal going out from some of the neuron in the network. To get a stable
output, we can execute the forward pass many times on the same test sample and
compute the average of the outputs. For instance, we can run the forward pass 1000
times. This way, we will get 1000 outputs for the same test sample. Then, we can
simply compute the average of 1000 outputs and obtain the final output.

3.6 Training a ConvNet

121

However, this method is not practical since obtaining a result for one samples
requires running the network many times. The efficient way is to run the forward
pass only one time but scale the output of dropout gates in the test time. To be more
specific, the dropout gates (black squares in the figure) act as scalers rather than
blockers. They simply get the output of the neuron and pass it to the next layer after
rescaling by factor β. Determining value of β is simple.
Assume a single neuron attached to a dropout blocker. Assume that the output of
the neuron for the given input xi is z. Since there is no randomness in a neuron, it will
always return z for the input xi . However, when it passes through a dropout blocker,
it will be blocked with probability p. In other words, if we perform the forward pass
N times, we expect that (1 − p) × N times z is passed by the blocker and p × N times
it is blocked (0 passed through the blocker). The average value of the dropout gate
= (1 − p) × z.
will be equal to (1−p)×N×z+p×N×0
N
Consequently, instead of running the network many times in the test time, we can
simply set β = 1 − p and rescale the output neuron connected to the dropout layer
by this factor. In this case, dropout gates will act as scalers instead of blockers.10
Dropout is an effective way of regularizing neural networks including ConvNets.
Commonly, dropout layers are placed after fully connected layers in a ConvNet.
However, it is not a golden rule. One can attach a dropout layer to the input in order
to generate noisy inputs! Also, the dropout ratio p is usually set to 0.5 but there is no
theoretical proof to tell what should be the value of dropout ratio. We can start from
p = 0.5 and adjust it using the development set.

3.6.3.5 Mixed Regularization
We can incorporate several methods for regularizing a ConvNet. For example, using
both L2 regularization and dropout are common. But, you can combine all the regularizations methods we explained in this section and train your network.

3.6.4 Learning Rate Annealing
Stochastic gradient descent have a user-defined parameter called learning rate which
we denote it by α. The training usually starts with an initial value for the learning
rate such as α = 0.001. The learning rate can be kept constant all the time during
the training. Ideally, if the initial value of the learning rate is chosen properly, we
expect that the loss function is decreased at each iteration.
In other words, the algorithm gets closer to a local minimum at each iteration.
Depending on the shape of loss function in high-dimensional space, the optimization

10 For

interested readers: More efficient way of implementing dropout is to rescale the signals by
1
if they pass through dropout gates during the training. Then, in the test time, we can
factor 1−p
simply remove the dropout layer from the network and compute the forward pass as we do in a
network without dropout layers. This technique is incorporated by the Caffe library.

122

3 Convolutional Neural Networks

Fig. 3.22 If the learning rate is kept fixed it may jump over local minimum (left). But, annealing
the learning rate helps the optimization algorithm to converge to a local minimum

algorithm may fluctuate near local minimum and it may not converge to the local
minimum. One possible cause of fluctuations could be the learning rate. The reason
is that a gradient based method moves toward local minimum based on the gradient
of loss function. When the learning rate is kept constant and the current location is
close to a local minimum, the algorithm may jump over the local minimum after
multiplying the gradient with the learning rate and updating the location based on
this value.
This problem is illustrated in Fig. 3.22. In the left, the learning rate is kept constant.
We see that the algorithm jumps over the local minimum and it may or may not
converge to the local minimum in finite iterations. In contrary, in the right plot, the
learning rate is reduced linearly at each iteration. We see that the algorithm is able
to converge to the local minimum in finite iterations.
In general, it is a good practice to reduce the learning rate over time. This can be
done in different ways. Denoting the initial learning rate by αinitial , the learning rate
at iteration t can be obtained by:
αt = αinitial × γ t

Fig. 3.23 Exponential
learning rate annealing

(3.48)

3.6 Training a ConvNet

123

Fig. 3.24 Inverse learning
rate annealing

where γ ∈ [0, 1] is a user-defined value. Figure 3.23 shows the plot of this function
for different values of γ with αinitial = 0.001. If the value of γ is close to zero, the
learning rate will approach zero quickly. The value of gamma is chosen based on
maximum number of iterations of the optimization algorithm. For example, if the
maximum number of iterations is equal 20,000, γ may take a value smaller than but
close to 0.9999. In general, we have to adjust γ such that the learning rate becomes
smaller in the last iterations. If the maximum number of iterations is equal to 20,000
and we set γ to 0.99, it is likely that the ConvNet will not learn because the learning
rate has become almost zero after 1000 iterations. This learning annealing method
is known as exponential annealing.
Learning rate can be also reduced using
αt = αinitial × (1 + γ × t)−β

(3.49)

where γ and β are user-defined parameters. Figure 3.24 illustrates the plot of this
function for different values of γ and β = 0.99. This annealing method is known as
inverse annealing. Similar to the exponential annealing, the parameters of the inverse
annealing method should be chosen such that the learning rate becomes smaller when
it reaches to the maximum number of iterations.
The last annealing method which is commonly used in training neural networks
is called step annealing and it is given by
αt = αinitial × γ t d

(3.50)

In the above equation, denotes the integer division operator and γ ∈ [0, 1] and
d ∈ Z + are user-defined parameters. The intuition behind this method is that instead
of constantly reducing the learning rate, we can multiply the learning rate with γ
every d iterations. Figure 3.25 shows the plot of this function for different values of
γ and d = 5000.

124

3 Convolutional Neural Networks

Fig. 3.25 Step learning rate
annealing

In contrast to other two methods, adjusting parameters of the step annealing is
straightforward. The step parameters d is usually equal to the number of training
samples or fraction/multiple of this number. For example, if there are 10,000 samples
in the training set, we may consider setting d to 5,000 meaning that the learning rate
will be reduced every 5,000 samples. The amount of reduction can be chosen based
on the maximum number of iterations and step size d. Also, in the case of mini-batch
gradient descend with batch size 50, setting d to 100 will exactly reduce the learning
rate every 5,000 samples.

3.7 Analyzing Quantitative Results
Throughout this chapter, we discussed about designing, implementing, and evaluating ConvNets. So far, we see that the dataset is divided into three disjoint sets namely
training, development, and test. Then, the idea is implemented, trained, and evaluated. We also explained that assessing the idea is done using a single quantitative
number computed by a metric such as accuracy or F1-score. Based on the results
on these three sets of data, we decide whether the model must be refined or it is
satisfactory.

Table 3.1 Four different scenarios that may happen in practice
Scenario 1 (%)

Scenario 2 (%)

Scenario 3 (%)

Scenario 4 (%)

Goal

99

99

99

99

Train

80

98

98

98

80

97

97

80

97

Development
Test

3.7 Analyzing Quantitative Results

125

Typically, we may encounter four scenarios illustrated in Table 3.1. Assume that
our evaluation metric is accuracy. We are given a dataset and we have already split
the dataset into training, development, and test sets. The goal is to design a ConvNet
with 99% accuracy. Assume we have designed a ConvNet and trained it. Then, the
accuracy of the ConvNet on the training set is 80% (Scenario 1). Without even
assessing the accuracy on the development and test sets, we conclude that this idea
is not good for our purpose. The possible actions in this scenario are
• Train the model longer
• Make the model bigger (e.g., increasing number of filters in each layer, increasing
number of neuron in fully connected layers)
• Design a new architecture
• Make the regularization coefficient λ smaller (closer to zero)
• Increase the threshold of norm in the max-norm constraint
• Reduce the dropout ratio
• Check the learning rate and learning rate annealing
• Plot the value of loss function in all the iterations to see if the loss is decreasing
or it is fluctuating or it is constant.
In other words, if we are sure that the ConvNet is trained for enough number of
iterations and the learning rate is correct we can conclude that the current ConvNet
is not flexible enough to capture the nonlinearity of data. This means that the current
model has a high bias. So, we have to increase the flexibility of our model. Paying
attention to the above solutions, we realize that most of them try to increase the
flexibility of the model.
We may apply the above solutions and increase the accuracy on the training set to
98%. However, when the model is evaluated on the development set, the accuracy is
80% (Scenario 2). This is mainly a high variance problem meaning that our model
might be very flexible and it captures every detail in the training set. In other words,
it overfits on the training set. Possible actions in this scenario are
•
•
•
•
•

Make the regularization coefficient λ bigger
Reduce the threshold in max-norm constraint
Increase the dropout ratio
Collect more data
Synthesize new data on the training set (we will discuss this method in the next
chapters)
• Change the model architecture.
If we decide to change the model architecture, we have to keep this in mind that
the new architecture must be less flexible (e.g., shallower, fewer neurons/filters) since
our current model is very flexible and it overfits on the training set. After applying
these changes, we may find a model with 98 and 97% accuracies on the training set
and development set, respectively. But, after evaluating the model on the test set we
realize that its accuracy is 80% (Scenario 3).

126

3 Convolutional Neural Networks

At this point, one may consider changing the model architecture or tweaking the
model parameters in order to increase the accuracy on the test set as well. But, this
approach is wrong and the model trained this way may not work in real world. The
reason is that, we have tried to adjust our models on both development set and test
set. However, the main problem in Scenario 3 is that our model is overfit on the
development set. If we try to adjust it on the test set, we cannot be sure that the high
accuracy on the test set is because the model is generalized well or it is because the
model is overfit on the test set. So, the best solution in this case is to collect more
development data. By collecting data we mean new and fresh data.
The Scenario 4 is what we usually expect to achieve in practice. In this scenario,
we have adjusted our model on the development set but it also produces good results
on the test set. In this case, we can be confident that our model is ready to be used in
real world.
There are other serious issues about data such as what happens if the distribution
of test set is different from training and development set. Solutions for addressing
this problem are not within the scope of this book. Interested reader can refer to
textbooks about data science for more details.

3.8 Other Types of Layers
ConvNets that we will design for detecting and classifying traffic signs are composed
of convolution, pooling, activation, and fully connected layers. However, there are
other types of layers that have been proposed recently and there are several works
utilizing these kind of layers. In this section, we will explain some of these layers.

3.8.1 Local Response Normalization
Local response normalization (LRN) (Krizhevsky et al. 2012) is a layer which is usually placed immediately after the activation of a convolution layer. In the reminder
of this section, when we say a convolution layer we refer to the activation of the convolution layer. Considering that feature maps of a convolution layer has N channels
of size H × W , the LRN layer will produce a new N channel feature maps of size
H × W (exactly the same size as feature maps of the convolution layer) where the
i
at location (m, n) in the ith channel is given by
element bm,n
i
am,n

i
bm,n
=

k+α

min(N−1,i+n/2)
j=max(0,i−n/2)

2 β
j
am,n

(3.51)

i
In the above equation, am,n
denotes the value of the feature map of the convolution
layer at spatial location (m, n) from ith channel. Also, k, n, α and β are user defined

3.8 Other Types of Layers

127

parameters. Their default value are k = 2, n = 5, α = 10−4 and β = 0.75. The
LRN layer normalizes the activations in the same spatial location using neighbor
channels. This layer does not have a trainable parameter.

3.8.2 Spatial Pyramid Pooling
Spatial pyramid pooling (He et al. 2014) is proposed to generate fixed-length feature
vectors for input images with arbitrary size. A spatial pyramid pooling layer is placed
just before the first fully connected layer. Instead of pooling a feature map with a fixed
size, it divides the feature maps into fixed number of regions and pool all elements
inside each region. Also, as it is illustrated in the figure, it does this in several scales.
In the first scale, it pools over whole feature map. In the second scale, it divides each
feature map into four regions. In the third scale, it divides the feature map into 16
regions. Then, it concatenates all these vectors and connects it to the fully connected
layer.

3.8.3 Mixed Pooling
Basically, we put one pooling layer after a convolution layer. Lee et al. (2016) proposed an approach which is called mixed pooling. The idea behind mixed pooling
is to combine max-pooling and average pooling. It turns out that mixed pooling
combines the output of a max pooling and average pooling as follows:
poolmix = αpoolmax + (1 − α)poolavg

(3.52)

In the above equation, α ∈ [0, 1] is a trainable parameter which can be trained using
the standard backpropagation algorithm.

3.8.4 Batch Normalization
Distribution of each layer in a ConvNet changes during training and it varies from
one layer to another. This reduces the convergence speed of the optimization algorithm. Batch normalization (Ioffe and Szegedy 2015) is a technique to overcome this
problem. Denoting the input of a batch normalization layer with x and its output
using z, batch normalization applies the following transformation on x:
x−μ
γ + β.
z= √
σ2 + 

(3.53)

Basically, it applies the mean-variance normalization on the input x using μ and
σ and linearly scales and shifts it using γ and β. The normalization parameters μ
and σ are computed for the current layer over the training set using a method called

128

3 Convolutional Neural Networks

exponential moving average. In other words, they are not trainable parameters. In
contrast, γ and β are trainable parameters.
In the test time, the μ and σ that are computed over the training set are used for
doing the forward pass and they remain unchanged. The batch normalization layer
is usually placed between the fully connected/convolution layer and its activation
function.

3.9 Summary
Understanding the underlying process in a convolutional neural networks is crucial
for developing reliable architectures. In this chapter, we explained how convolution
operations are derived from fully connected layers. For this purpose, weight sharing
mechanism of convolutional neural networks was discussed. Next basic building
block in convolutional neural network is pooling layer. We saw that pooling layers
are intelligent ways to reduce dimensionality of feature maps. To this end, a max
pooling, average pooling, or a mixed pooling is applied on feature maps with a stride
bigger than one.
In order to explain how to design a neural network, two classical network architectures were illustrated and explained. Then, we formulated the problem of designing
network in three stages namely idea, implementation, and evaluation. All these stages
were discussed in detail. Specifically, we reviewed some of the libraries that are
commonly used for training deep networks. In addition, common metrics (i.e., classification accuracy, confusion matrix, precision, recall, and F1 score) for evaluating
classification models were mentioned together with their advantages and disadvantages.
Two important steps in training a neural network successfully are initializing its
weights and regularizing the network. Three commonly used methods for initializing
weights were introduced. Among them, Xavier initialization and its successors were
discussed thoroughly. Moreover, regularization techniques such as L1 , L2 , max-norm,
and dropout were discussed. Finally, we finished this chapter by explaining more
advanced layers that are used in designing neural networks.

3.10 Exercises
3.1 How can we compute the gradient of convolution layer when the convolution
stride is greater than 1?
3.2 Compute the gradient of max pooling with overlapping regions.

3.10 Exercises

129

3.3 How much memory is required by LeNet-5 for feed forward an image and keep
the information of all layers?
3.4 Show that the number of parameters of AlexNet is equal to 60,965,224.
3.5 Assume that there are 500 different classes in X  and there are 100 samples in
each class yielding 50,000 samples in X  . In which situations the accuracy score
is a reliable metric for assessing the model? In which situations the accuracy score
might be very close to 1 but it model might not be practically accurate?
3.6 Consider the trivial example where precision is equal to 0 and recall is equal to
1. Show that why computing harmonic mean is preferable over simple averaging.
3.7 Plot the logistic loss function and L2 regularized logistic loss function with
different values for λ and compare the results. Repeat the procedure using L1 regularization and elastic nets.

References
Aghdam HH, Heravi EJ, Puig D (2016) Computer vision ECCV 2016 workshops, vol 9913, pp
178–191. doi:10.1007/978-3-319-46604-0
Coates A, Ng AY (2012) Learning feature representations with K-means. Lecture notes in computer
science (lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 7700.
LECTU, pp 561–580. doi:10.1007/978-3-642-35289-8-30
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical
image database. In: IEEE conference on computer vision and pattern recognition, pp 2–9. doi:10.
1109/CVPR.2009.5206848
Dong C, Loy CC, He K (2014) Image super-resolution using deep convolutional networks, vol
8828(c), pp 1–14. doi:10.1109/TPAMI.2015.2439281, arXiv:1501.00092
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th international conference on artificial intelligence and statistics
(AISTATS), vol 9, pp 249–256. doi:10.1.1.207.2059. http://machinelearning.wustl.edu/mlpapers/
paper_files/AISTATS2010_GlorotB10.pdf
He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for
visual recognition, cs.CV, pp 346–361. doi:10.1109/TPAMI.2015.2389824, arXiv:abs/1406.4729
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. arXiv:1502.01852
Hinton G (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn
Res (JMLR) 15:1929–1958
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing
internal covariate shift. In: Proceedings of the 32nd international conference on machine learning
(ICML), Lille, pp 448–456. doi:10.1007/s13398-014-0173-7.2, http://www.JMLR.org
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural
networks. In: Advances in neural information processing systems. Curran Associates, Inc., pp
1097–1105

130

3 Convolutional Neural Networks

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document
recognition. Proc IEEE 86(11):2278–2323. doi:10.1109/5.726791, arXiv:1102.0183
Lee CY, Gallagher PW, Tu Z (2016) Generalizing pooling functions in convolutional neural networks: mixed, gated, and tree. Aistats 51. arXiv:1509.08985
Mishkin D, Matas J (2015) All you need is a good init. In: ICLR, pp 1–8. doi:10.1016/08981221(96)87329-9, arXiv:1511.06422
Scherer D, Müller A, Behnke S (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In: International conference on artificial neural networks, vol 6354.
LNCS, pp 92–101
Schmidhuber J (2015) Deep Learning in neural networks: an overview. Neural Networks 61:85–117.
doi:10.1016/j.neunet.2014.09.003, arXiv:1404.7828
Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2015) Striving for simplicity: the all convolutional net. In: ICLR-2015 workshop track, pp 1–14. arXiv:1412.6806
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum
in deep learning. JMLR W&CP 28(2010):1139–1147. doi:10.1109/ICASSP.2013.6639346
Zeiler MD, Fergus R (2013) Stochastic pooling for regularization of deep convolutional neural
networks. In: ICLR, pp 1–9. arXiv:1301.3557

4

Caffe Library

4.1 Introduction
Implementing ConvNets from scratch is a tedious task. Especially, implementing the
backpropagation algorithm correctly requires to calculate the gradient of each layer
correctly. Even after implementing the backward pass, it has to be validated by computing the gradient numerically and comparing it with the result of backpropagation.
This is called gradient check. Moreover, efficient implementation of each layer on
GPU is another hard work. For these reasons, it might be more practical to use a
library for this purpose.
As we discussed in the previous chapter, there are many libraries and frameworks
that can be used for training ConvNets. Among them, there is one library which is
suitable for development as well as applied research. This library is called Caffe.1
Figure 4.1 illustrates the structure of Caffe.
The Caffe library is developed in C++ and it utilizes CUDA library for performing
computations on GPU.2 There is a library which is developed by NVIDIA and it is
called cuDNN. It has implemented common layers found in ConvNets as well as
their gradients. Using cuDNN, it is possible to design and train ConvNets which are
only executed on GPUs. Caffe makes use of cuDNN for implementing some of layers
on GPU. It has also implemented some other layers directly using CUDA. Finally,
besides providing interfaces for Python and MATLAB programming languages, it
also provides a command tool that can be used for training and testing ConvNets.
One beauty of Caffe is that designing and training a network can be done by
employing text files which are later parsed using Protocol Buffers library. But, you
are not limited to design and train using only text files. It is possible to also design and

1 http://caffe.berkeleyvision.org.
2 There

are some branches of Caffe that use OpenCL for communicating with GPU.

© Springer International Publishing AG 2017
H. Habibi Aghdam and E. Jahani Heravi, Guide to Convolutional
Neural Networks, DOI 10.1007/978-3-319-57550-6_4

131

132

4 Caffe Library

Fig. 4.1 The Caffe library
uses different third-party
libraries and it provides
interfaces for C++, Python,
and MATLAB programming
languages

train ConvNets by writing a computer program in C++, Python or MATLAB. However, a detailed analysis of ConvNets has to be done by writing compute programs
or special softwares.
In this chapter, we will first explain how to use text files and the command tools
for designing and training ConvNets. Then, we will explain how to do it in Python.
Finally, methods for analyzing ConvNets using Python will be also discussed.

4.2 Installing Caffe
Installation of Caffe requires installing CUDA and some third-party libraries on your
system. The list of required libraries can be found in caffe.berkeleyvision.org. If you
are using Ubuntu, Synaptic Package Manager can be utilized for installing these
libraries. Next, CUDA drivers must be installed on the system. Try to download the
latest CUDA driver compatible with Caffe from NIVIDIA website. Installing CUDA
drivers can be as simple as just running the installation file. In worst case scenario,
it may take some time to figure out what are the error messages and to finally install
it successfully.
After that, cuDNN library must be downloaded and copied into the CUDA folder
which is by default located in /usr/local/cuda.. You must copy the cudnn*.h into the
include folder and libcudnn*.so* into lib/lib64 folder. Finally, you must follow the
instructions provided in the Caffe’s website for installing this library.

4.3 Designing Using Text Files
A ConvNet and its training procedure can be defined using two text files. The first text
file defines architecture of the neural network including ConvNets and the second
file defines the optimization algorithm as well as its parameters. These text files are

4.3 Designing Using Text Files

133

usually stored with .prototxt extension. This extension shows that the text inside
these files follows the syntax defined by the Protocol Buffers (protobuf) protocol.3
A protobuf is composed of messages where each message can be interpreted as a
struct in a programming language such as C++. For example, the following protobuf
contains two messages namely Person and Group.
message Person {
required string name = 1;
optional int32 age = 2;
repeated string email = 3;
}

1
2
3
4
5
6

message Group {
required string name = 1;
repeated Person member = 3;
}

7
8
9
10

Listing 4.1 A protobuf with two messages.

The field rule required shows that specifying this field in the text file is mandatory.
In contrast, the rule optional shows that specifying this field in the text file is optional.
As it turns out, the rule repeated states that this filed can be repeated zero or more
times in the text file. Finally, numbers after the equal signs are unique tag numbers
which are assigned to each field in a message. The number has to be unique inside
the message.
From programming perspective, these two messages depict two data structures
namely Person and Group. The Person struct is defined using three fields including
one required, one optional and one repeated (array) field. The Group struct also is
defined using one required filed and one repeated filed, where each element in this
field is an instance of Person.
You can write the above definition in a text editor and save it with .proto extension
(e.g. sample.proto). Then, you can open the terminal in Ubuntu and execute the
following command:
p r o t o c − I = SRC_DIR −− p y t h o n _ o u t = DST_DIR SRC_DIR / s a m p l e . p r o t o

1

If the command is executed successfully, you should find a file named sample_pb2.py in directory DST_DIR. Instantiating Group can be done in a programming
language. To this end, you should import sample_pb2.py to python environment and
run the following code:
g = sample_pb2.Group()
g.name =’group 1’

1
2
3

m = g.member.add()
m.name = ’Ryan’
m.age=20
m.email.append(’mail1@sample.com’)
m.email.append(’mail1@sample.com’)

4
5
6
7
8
9

m = g.member.add()
m.name = ’Harold’
m.age=23

3 Implementations

10
11
12

of the methods in this chapter are available at github.com/pcnn/ .

134

4 Caffe Library

Using the above code, we create a group called “group 1” with two members. The
age of the first member is 20, his name is “Ryan” and he has two email addresses.
Moreover, the name of second member is “Harold”. He is 23 years old and he does
not have any email.
The appealing property of protobuf is that you can instantiate the Group structure
using a plain text file. The following plain text is exactly equivalent to the above
Python code:
name: "group 1"
member {
name: "member1"
age: 20
email : "mail1@sample.com"
email : "mail1@sample.com"
}
member {
name: "member2"
age: 23
}

1
2
3
4
5
6
7
8
9
10
11

This method has some advantages over instantiating using programming. First, it
is independent of programming language. Second, its readability is higher. Third, it
can be easily edited. Fourth, it is more compact. However, there might be some cases
that instantiating is much faster when we write a computer program rather than a
plain text file.
There is a file called caffe.proto inside the source code of the Caffe library which
defines several protobuf messages.4 We will use this file for designing a neural
network. In fact, caffe.proto is the reference file that you must always refer to it
when you have a doubt in your text file. Also, it is constantly updated by developers
of the library. Hence, it is a good idea to always keep studying the changes in the
newer version so you will have a deeper knowledge about what can be implemented
using the Caffe library. There is a message in caffe.proto called “NetParameter” and
it is currently defined as follows5 :
message NetParameter {
optional string name = 1;
optional bool force_backward = 5 [ default = false ] ;
optional NetState state = 6;
optional bool debug_info = 7 [ default = false ] ;
repeated LayerParameter layer = 100;
}

1
2
3
4
5
6
7

We have excluded deprecated fields marked in the current version from the above
message. The architecture of a neural network is defined using this message. It
contains a few fields with basic data types (e.g., string, int32, bool). It has also one
field of type NetState and an array (repeated) of LayerParameters. Arguably, one
can learn Caffe just by throughly studying NetParameter. The reason is illustrated in
Fig. 4.2.

4 All the explanations for the Caffe library in this chapter are valid for the commit number 5a201dd.
5 This

definition may change in next versions.

4.3 Designing Using Text Files

135

Fig. 4.2 The NetParameter is indirectly connected to many other messages in the Caffe library

It is clear from the figure that NetParameter is indirectly connected to different
kinds of layers through LayerParameter. It turns outs that NetParameter is a container
to hold layers. Also, there are several other kind of layers in the Caffe library that we
have not included in the figure. The message LayerParamter has many fields. Among
them, following are the fields that we may need for the purpose of this book:
message LayerParameter {
optional string name = 1;
optional string type = 2;
repeated string bottom = 3;
repeated string top = 4;

1
2
3
4
5
6

optional ImageDataParameter image_data_param = 115;
optional TransformationParameter transform_param = 100;

7
8
9

optional
optional
optional
optional
optional
optional
optional
optional
optional
optional
optional
optional
optional
optional

AccuracyParameter accuracy_param = 102;
ConvolutionParameter convolution_param = 106;
CropParameter crop_param = 144;
DropoutParameter dropout_param = 108;
ELUParameter elu_param = 140;
InnerProductParameter inner_product_param = 117;
LRNParameter lrn_param = 118;
PoolingParameter pooling_param = 121;
PReLUParameter prelu_param = 131;
ReLUParameter relu_param = 123;
ReshapeParameter reshape_param = 133;
SigmoidParameter sigmoid_param = 124;
SoftmaxParameter softmax_param = 125;
TanHParameter tanh_param = 127;

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

optional HingeLossParameter hinge_loss_param = 114;

25
26

repeated ParamSpec param = 6;
optional LossParameter loss_param = 101;

27
28
29

optional Phase phase = 10;
}

30
31

136

4 Caffe Library

Fig. 4.3 A computational
graph (neural network) with
three layers

Each layer has a name. Although entering a name for a layer is optional but it is
highly recommended to give each layer a unique name. This increases readability of
your model. It has also another function. Assume you want to have two convolution
layers with exactly the same parameters. In other words, these two convolution layers
share the same set of weights. This can be easily specified in Caffe by giving these
two layers an identical name.
The string filed “type” specifies the type of the layer. For example, by assigning
“Convolution” to this field, we tell Caffe that the current layer is a convolution layer.
Note that the type of layer is case-sensitive. This means that, assigning “convolution” (small letter c instead of capital letter C) to type will raise an error telling that
“convolution” is not a valid type for a layer.
There are two arrays of strings in LayerParameter called “top” and “bottom”. If
we assume that a layer (an instance of LayerParameter) is represented by a node
in computational graphs, the bottom variable shows the tag of incoming nodes to
the current node and the top variable shows the tag of outgoing edges. Figure 4.3
illustrates a computational graph with three nodes.
This computational graph is composed of three layers namely data, conv1 and
cr op1 . For now, assume that the node data reads images along with their labels
from a disk and stores them in memory. Apparently, the node data does not get its
information from another node. For this reason, it does not have any bottom (the
length of bottom is zero). The node data passes this information to other nodes in
the graph. In Caffe, the information produced by a node is recognized by unique
tags. The variable top stores the name of these tags. A tag and name of a node could
be identical. As we can see in node data, it produces only one output. Hence, the
length of array top will be equal to 1. The first (and only) element in this array shows
the tag of the first output of the node. In the case of data, the tag has been also called
data. Now, any other node can have access to information produced by the node
data using its tag.
The second node is a convolution node named conv1 . This node receives information from node data. The convolution node in this example has only one incoming

4.3 Designing Using Text Files

137

node. Therefore, length of bottom array for conv1 will be 1. The first (and only)
element in this array refers to the tag, where the information from this tag will
come to conv1 . In this example, the information comes from data. After convolving bottom[0] with filters in conv1 (the value of filter are stored in node itself), it
produces only one output. So, length of array top for conv1 will be equal to 1. The
tag of output for conv1 has been called c1. In this case, the name of node and top of
node are not identical.
Finally, the node cr op1 receives two inputs. One from conv1 and one from data.
For this reason, the bottom array in this node has two elements. The first element is
connected to data and the second element is connected to c1. Then, cr op1 , crops the
first element of bottom (bottom[0]) to make its size identical to the second element of
bottom (bottom[1]). This node also generates a single output. The tag of this output
is cr p1.
In general, passing information between computational nodes is done using array
of bottoms (incoming) and array of tops (outgoing). Each node stores information
about its bottoms and tops as well as its parameters and hyperparameters. There are
many other fields in LayerParameter all ending with phrase “Parameter”. Based the
type of a node, we may need to instantiate some of these fields.

4.3.1 Providing Data
The first thing to put in a neural network is at least one layer that provides data for the
network. There are a few ways in Caffe to do this. The simplest approach is to provide
data using a layer with type=”ImageData”. This type of layer requires instantiating
the field image_data_param from LayerParameter. ImageDataParameter is also a
message with the following definition:
message ImageDataParameter {
optional string source = 1;

1
2
3

optional uint32 batch_size = 4 [ default = 1];
optional bool shuffle = 8 [ default = false ] ;

4
5
6

optional uint32 new_height = 9 [ default = 0];
optional uint32 new_width = 10 [ default = 0];

7
8
9

optional bool is_color = 11 [ default = true ] ;
optional string root_folder = 12 [ default = "" ] ;
}

10
11
12

Again, deprecated fields have been removed from this list. This message is composed
of fields with basic data types. An ImageData layer needs a text file with the following
structure:
ABSOLUTE_PATH_OF_IMAGE1 LABEL1
ABSOLUTE_PATH_OF_IMAGE2 LABEL2
...
ABSOLUTE_PATH_OF_IMAGEN LABELN

Listing 4.2 Structure of train.txt

1
2
3
4

138

4 Caffe Library

An ImageData layer assumes that images are stored on the disk using a regular image
format such as jpg, bmp, ppm, png, etc. Images could be stored on different locations
and different disks on your system. In the above structure, there is one line for each
image in the training set. Each line is composed of two parts separated by a space
character (ASCII code 32). The left part shows the absolute path of the image and
the right part shows the class label of that image.
The current implementation of Caffe identifies class label from image using the
space character in the line. Consequently, if the path of the image contains space
characters, Caffe will not able to decode this line and it may raise an exception. For
this reason, avoid space characters in the name of folders and files when you are
creating a text file for an ImageData layer.
Moreover, class labels have to be integer numbers and they have to always start
from zero. That said, if there are 20 classes in your dataset, the class labels have to
be integer numbers between 0 and 19 (19 included). Otherwise, Caffe may raise an
exception during training. For example, the following sample shows a small part of
a text file that is prepared for an ImageData layer.
/home/pc/Desktop/GTSRB/Training_CNN/00019/00000_00006.ppm 19
/home/pc/Desktop/GTSRB/Training_CNN/00029/00003_00021.ppm 29
/home/pc/Desktop/GTSRB/Training_CNN/00010/00054_00008.ppm 10
/home/pc/Desktop/GTSRB/Training_CNN/00023/00010_00027.ppm 23
/home/pc/Desktop/GTSRB/Training_CNN/00033/00022_00008.ppm 33
/home/pc/Desktop/GTSRB/Training_CNN/00021/00000_00005.ppm 21
/home/pc/Desktop/GTSRB/Training_CNN/00005/00020_00022.ppm 5
/home/pc/Desktop/GTSRB/Training_CNN/00025/00026_00018.ppm 25
...

1
2
3
4
5
6
7
8
9

Suppose that our dataset contains 3,000,000 images and they are all located in a
common folder. In the above sample, all files are stored at /home/pc/Desktop/GTSRB/Training_CNN. However, this common address is repeated in the text file 3
million times since we have provided absolute path of images. Taking into account
the fact that Caffe loads all the paths and their labels into memory once, this means
3,000,000 × 35 characters are repeated in the memory which is equal to about 100
MB memory. If the common path is longer or the number of samples is higher, more
memory will be needed to store the information.
To use the memory more efficiently, ImageDataParameter has provided a filed
called root_folder. This field points to the path of the common folder in the text file.
In the above example, this will be equal to /home/pc/Desktop/GTSRB/Training_CNN.
In that case, we can remove the common path from the text file as follows:
/00019/00000_00006.ppm
/00029/00003_00021.ppm
/00010/00054_00008.ppm
/00023/00010_00027.ppm
/00033/00022_00008.ppm
/00021/00000_00005.ppm
/00005/00020_00022.ppm
/00025/00026_00018.ppm
...

19
29
10
23
33
21
5
25

1
2
3
4
5
6
7
8
9

4.3 Designing Using Text Files

139

Caffe will always add the root_folder to the beginning of path in each line. This
way, redundant information are not stored in the memory.
The variable batch_size denotes the size of mini-batch to be forwarded and backpropagated in the network. Common values for this parameter vary between 20 and
256. This also depends on the available memory on your GPU. The Boolean variable
shuffle shows whether or not Caffe must shuffle the list of files in each epoch or not.
Shuffling could be useful for having diverse mini-batches at each epoch. Considering
the fact that one epoch refers to processing whole dataset, the list of files is shuffled
when the last mini-batch of dataset is processed. In general, setting shuffle to true
could be a good practice. Especially, setting this value to true is essential when the
text file containing the training samples is ordered based on the class label. In this
case, shuffling is an essential step in order to have diverse mini-batches. Finally, as
it turns out from their name, if new_height and new_width have a value greater than
zero, the loaded image will be resized to the new size based on the value of these
parameters. Finally, the variable is_color tells Caffe to load images in color format
or grayscale format.
Now, we can define a network containing only an ImageData layer using the
protobuf grammar. This is illustrated below.
name: "net1"
layer{
name: "data"
type : "ImageData"
top : "data"
top : "label"
image_data_param{
source : " /home/pc/Desktop/ train . txt "
batch_size:30
root_folder : " /home/pc/Desktop/ "
is_color : true
shuffle : true
new_width:32
new_height:32
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

In Caffe, a tensor is a mini − batch × Channel × H eight × W idth array. Note
that an ImageData layer produces two tops. In other words, the length of top array for
this layer is 2. The first element of the top array stores loaded images. Therefore, the
first top of the above layer will be a 30 × 3 × 32 × 32 tensor. The second element
of the top array stores labels of each image in the first top and it will be an array with
mini − batch integer elements. Here, it will be a 30-element array of integers.

4.3.2 Convolution Layers
Now, we want to add a convolution layer to the network and connect it to the ImageData layer. To this end, we must create a layer with type=”Convolution” and then
configure the layer by instantiating convolution_param. The type of this variable is
ConvolutionParameter which is defined as follows:

140

4 Caffe Library

message ConvolutionParameter {
optional uint32 num_output = 1;
optional bool bias_term = 2 [ default = true ] ;

1
2
3
4

repeated uint32 pad = 3;
repeated uint32 kernel_size = 4;
repeated uint32 stride = 6;

5
6
7
8

optional FillerParameter weight_filler = 7;
optional FillerParameter bias_filler = 8;
}

9
10
11

The variable num_output determines the number of convolution filters. Recall
from the previous chapter that the activation of neuron basically is given by
}(wx + bias). The variable bias_term states that whether or not the bias term must
be considered in the neuron computation. The variable pad denotes the zero-padding
size and it is 0 by default. Zero padding is used to handle the borders during convolution. Zero-Padding a H × W image with pad=2 can be thought as creating a
zero matrix of size (H + 2 pad) × (W + 2 pad) and copying the image into this
matrix such that is placed exactly in the middle of the zero matrix. Then, if the size
of convolution filters is (2 pad + 1) × (2 pad + 1), the result of convolution with
zero-padded image will be H × W images which is exactly equal to the size of input
image. Padding is usually done for keeping the size of input and output of convolution
operations constant. But, it is commonly set to zero.
As it turns out, the variable kernel_size determines the spatial size (width and
height) of convolution filters. It should be noted that a convolution layer must have
the same number of bottoms and tops. It convolves each bottom separately with
the filter and passes it to the corresponding top. The third dimension of filters is
automatically computed by Caffe based on the number of channels coming from
the bottom node. Finally, the variable stride illustrates the stride of convolution
operation and it is set to 1 by default. Now, we can update the protobuf text and add
a convolution layer to the network.
name: "net1"
layer{
name: "data"
type : "ImageData"
top : "data"
top : "label"
image_data_param{
source : " /home/hamed/Desktop/ train . txt "
batch_size:30
root_folder : " /home/hamed/Desktop/ "
is_color : true
shuffle : true
new_width:32
new_height:32
}
}
layer{
name: "conv1"
type : "Convolution"
bottom: "data"
top : "conv1"
convolution_param{
num_output: 6
kernel_size :5
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

4.3 Designing Using Text Files

141

The convolution layer has six filters of size 5 × 5 and it is connected to a data
layer that produces mini-batches of images. Figure 4.4 illustrates the diagram of the
neural network created by the above protobuf text.

4.3.3 Initializing Parameters
Any layer with trainable parameters including convolution layers has to be initialized
before training. Concretely, convolution filters (weights) and biases of convolution
layer have to be initialized. As we explained in the previous chapter, this can be
done by setting each weight/bias to a random number. However, generating random number can be done using different distributions and different methods. The
weight_filler and bias_filler parameters in LayerParameter specify the type of initialization method. They are both instances of FillerParameter which are defined as
follows:
message FillerParameter {
optional string type = 1 [ default = ’constant ’ ] ;
optional float value = 2 [ default = 0];
optional float min = 3 [ default = 0];
optional float max = 4 [ default = 1];
optional float mean = 5 [ default = 0];
optional float std = 6 [ default = 1];

1
2
3
4
5
6
7
8

enum VarianceNorm {
FAN_IN = 0;
FAN_OUT = 1;
AVERAGE = 2;
}
optional VarianceNorm variance_norm = 8 [ default = FAN_IN] ;
}

9
10
11
12
13
14
15

The string variable type defines the method that will be used for generating number.
Different values can be assigned to this variable. Among them, “constant”, “gaussian”, “uniform”, “xavier” and “mrsa” are commonly used in classification networks.
Concretely, a constant filler sets the parameters to a constant value specified by the
floating point variable value.
Also, a “gaussian” filler assigns random numbers generated by a Gaussian distribution specified by mean and std variables. Likewise, “uniform” filler assigns random

Fig. 4.4 Architecture of the network designed by the protobuf text. Dark rectangles show nodes.
Octagon illustrates the name of the top element. The number of outgoing arrows in a node is equal
to the length of top array of the node. Similarly, the number of incoming arrows to a node shows
the length of bottom array of the node. The ellipses show the tops that are not connected to another
node

142

4 Caffe Library

number generated by the uniform distribution within a range determined by min and
max variables.
The 
“xavier” filler generates uniformly distributed random numbers within


[− n3 , n3 ], where depending on the value of variance_norm variable n could be the
number of inputs (FAN_IN), the number of output (FAN_OUT) or average of them.
The “msra” filler is like “xavier” filler. The difference is thatit generates Gaussian
distributed random number with standard deviation equal to n2 .
As we mentioned in the previous chapter, filters are usually initialized using
“xavier” or “mrsa” methods and biases are initialized using constant value zero.
Now, we can also define weight and bias initializer for the convolution layer. The
updated protobuf text will be:
name: "net1"
layer{
name: "data"
type : "ImageData"
top : "data"
top : "label"
image_data_param{
source : " /home/hamed/Desktop/ train . txt "
batch_size:30
root_folder : " /home/hamed/Desktop/ "
is_color : true
shuffle : true
new_width:32
new_height:32
}
}
layer{
name: "conv1"
type : "Convolution"
bottom: "data"
top : "conv1"
convolution_param{
num_output: 6
kernel_size :5
weight_filler{
type : "xavier"
}
bias_filler{
type : "constant"
value:0
}
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

4.3.4 Activation Layer
Each output of convolution layer is given by wx + b. Next, these values must be
passed through a nonlinear activation function. In the Caffe library, ReLU, Leaky
ReLU, PReLU, ELU, sigmoid, and hyperbolic tangent activations are implemented.
Setting type=”ReLU” will create a Leaky ReLU activation. If we set the leak value
to zero, this is equivalent to the ReLU activation. The other activations are created by
setting type=”PReLU”, type=”ELU”, type=”Sigmoid” and type=”TanH”. Then,

4.3 Designing Using Text Files

143

depending on the type of activation function, we can also adjust their hyperparameters. The messages for these activations are defined as follows:
message ELUParameter {
optional float alpha = 1 [ default = 1];
}
message ReLUParameter {
optional float negative_slope = 1 [ default = 0];
}
message PReLUParameter {
optional FillerParameter f i l l e r = 1;
optional bool channel_shared = 2 [ default = false ] ;
}

1
2
3
4
5
6
7
8
9
10

Clearly, the sigmoid and hyperbolic tangent activation do not have parameters to
set. However, as it is mentioned in (2.93) and (2.96) the family of the ReLU activation
in Caffe has hyperparameters that should be configured. In the case of Leaky ReLU
and ELU activations, we have to determine the value of α in (2.93) and (2.96). In
Caffe, α for Leaky ReLU is illustrated by negative_slope variable. In the case of
PReLU activation, we have to tell Caffe how to initialize the α parameter using the
filler variable. Also, the Boolean variable channel_shared determines whether Caffe
should share the same α for all activations (channel_shared=true) in the same layer
or it must find separate α for each channel in the layer. We can add this activation to
the protobuf as follows:
name: "net1"
layer{
name: "data"
type : "ImageData"
top : "data"
top : "label"
image_data_param{
source : " /home/hamed/Desktop/ train . txt "
batch_size:30
root_folder : " /home/hamed/Desktop/ "
is_color : true
shuffle : true
new_width:32
new_height:32
}
}
layer{
name: "conv1"
type : "Convolution"
bottom: "data"
top : "conv1"
convolution_param{
num_output: 6
kernel_size :5
weight_filler{
type : "xavier"
}
bias_filler{
type : "constant"
value:0
}
}
}
layer{
type : "ReLU"
bottom: "conv1"
top : "relu_c1"
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

144

4 Caffe Library

Fig. 4.5 Diagram of the network after adding a ReLU activation

After adding this layer to the network, the architecture will look like Fig. 4.5.

4.3.5 Pooling Layer
A pooling layer is created by setting type=”Pooling”. Similar to a convolution layer,
a pooling layer must have the same number of bottoms and tops. It applies pooling
on each bottom separately and passes it to the corresponding top. Parameters of the
pooling operation are also determined using an instance of PoolingParameter.
message PoolingParameter {
enum PoolMethod {
MAX = 0;
AVE = 1;
STOCHASTIC = 2;
}
optional PoolMethod pool = 1 [default = MAX];
optional uint32 pad = 4 [default = 0];

1
2
3
4
5
6
7
8
9

optional uint32 kernel_size = 2;
optional uint32 stride = 3 [default = 1];
optional bool global_pooling = 12 [default = false];
}

10
11
12
13

Similar to Convolutionparameter, the variables pad, kernel_size and stride determine
the amount of zero padding, size of pooling window, and stride of pooling, respectively. The variable pool determines the type of pooling. Currently, Caffe supports
max pooling, average pooling, and stochastic pooling. However, we often choose
max pooling and it is the default option in Caffe. The variable global_pooling pools
over the entire spatial region of bottom array. It is equivalent to setting kernel_size
to the spatial size of the bottom blob. We add a max-pooling layer to our network.
The resulting protobuf will be:
name: "net1"
layer{
name: "data"
type : "ImageData"
top : "data"
top : "label"
image_data_param{
source : " /home/hamed/Desktop/ train . txt "
batch_size:30
root_folder : " /home/hamed/Desktop/ "
is_color : true
shuffle : true
new_width:32
new_height:32

1
2
3
4
5
6
7
8
9
10
11
12
13
14

4.3 Designing Using Text Files
}
}
layer{
name: "conv1"
type : "Convolution"
bottom: "data"
top : "conv1"
convolution_param{
num_output: 6
kernel_size :5
weight_filler{
type : "xavier"
}
bias_filler{
type : "constant"
value:0
}
}
}
layer{
name: "relu_c1"
type : "ReLU"
bottom: "conv1"
top : "relu_c1"
relu_param{
negative_slope:0.01
}
}
layer{
name: "pool1"
type : "Pooling"
bottom: "relu_c1"
top : "pool1"
pooling_param{
kernel_size :2
stride :2
}
}

145
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

The pooling will be done over 2 × 2 regions with stride 2. This will halve the
spatial size of the input. Figure 4.6 shows the diagram of the network.
We added another convolution layer with 16 filters of size 5 × 5, a ReLU activation
and a max-pooling with 2 × 2 region and stride 2 to the network. Figure 4.7 illustrates
the diagram of the network.

4.3.6 Fully Connected Layer
A fully connected layer is defined by setting type=”InnerProduct” in the definition of
layer. The number of bottoms and tops must be equal in this type of layer. It computes

Fig. 4.6 Architecture of network after adding a pooling layer

146

4 Caffe Library

Fig. 4.7 Architecture of network after adding a pooling layer

the top for each bottom separately using the same set of parameters. Hyperparameters
of a fully connected layer are specified using an instance of InnerProductParameter
which is defined as follows.
message InnerProductParameter {
optional uint32 num_output = 1;
optional bool bias_term = 2 [ default = true ] ;
optional FillerParameter weight_filler = 3;
optional FillerParameter bias_filler = 4;
}

1
2
3
4
5
6

The variable num_output determines the number of neurons in the layer. The
variable bias_term tells Caffe whether or not to consider the bias term in neuron
computations. Also, weight_filler and bias_filler are used to specify how to initialize
the parameters of the fully connected layer.

4.3.7 Dropout Layer
A dropout layer can be placed anywhere in a network. But, it is more common to put it
immediately after an activation layer. However, it is mainly placed after activation of
fully connected layers. The reason is that fully connected layers increase nonlinearity
of a model and they apply final transformations on the extracted features by previous
layers. Our model may over fit because of the final transformations. For this reason,
we try to regularize the model using dropout layers in fully connected layers. A
dropout layer is defined by setting type=”Dropout”. Then, hyperparameter of a
dropout layer is determined using an instance of DropoutParameter which is defined
as follows:
message DropoutParameter {
optional float dropout_ratio = 1 [ default = 0.5];
}

1
2
3

As we can see, a dropout layer only has one hyperparameter which is the ratio of
dropout. Since this ratio shows the probability of dropout, it has to be set to a floating
point number between 0 and 1. The default value in Caffe is 0.5. We added two fully
connected layers to our network and placed a dropout layer after each of these layers.
The diagram of network after applying these changes is illustrated in Fig. 4.8.

4.3.8 Classification and Loss Layers
The last layer in a classification network is a fully connected layer, where the number
of neurons in this layer is equal to number of classes in the dataset. Training a

4.3 Designing Using Text Files

147

Fig. 4.8 Diagram of network after adding two fully connected layers and two dropout layers

neural network is done by minimizing a loss function. In this book, we explained
hinge loss and logistic loss functions for multiclass classification problems. These
two loss functions accept at least two bottoms. The first bottom is the output of the
classification layer and the second bottom is actual labels produced by the ImageData
layer. The loss layer computes the loss based on these two bottoms and returns a scaler
in its top.
The hinge loss function is created by setting type=”HingeLoss” and multiclass
logistic loss is created by setting type=”SoftmaxWithLoss”. Then, we mainly need
to enter the bottoms and top of the loss layer. We added a classification layer and a
multiclass logistic loss to the protobuf. The final protobuf will be:
layer{
name: "data"
type : "ImageData"
top : "data"
top : "label"
image_data_param{
source : " /home/hamed/Desktop/GTSRB/Training_CNN/ train . txt "
batch_size:30
root_folder : " /home/hamed/Desktop/GTSRB/Training_CNN/ "
is_color : true
shuffle : true
new_width:32
new_height:32
}
}
layer{
name: "conv1"
type : "Convolution"
bottom: "data"
top : "conv1"
convolution_param{
num_output: 6
kernel_size :5
weight_filler{ type : "xavier" }
bias_filler{ type : "constant" value:0 }
}
}
layer{
name: "relu_c1"
type : "ReLU"
bottom: "conv1"
top : "relu_c1"
relu_param{ negative_slope:0.01 }
}
layer{
name: "pool1"
type : "Pooling"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

148

4 Caffe Library
bottom: "relu_c1"
top : "pool1"
pooling_param{ kernel_size :2 stride :2 }

}
layer{
name: "conv2"
type : "Convolution"
bottom: "pool1"
top : "conv2"
convolution_param{
num_output: 16
kernel_size :5
weight_filler{ type : "xavier" }
bias_filler{ type : "constant" value:0 }
}
}
layer{
name: "relu_c2"
type : "ReLU"
bottom: "conv2"
top : "relu_c2"
relu_param{ negative_slope:0.01 }
}
layer{
name: "pool2"
type : "Pooling"
bottom: "relu_c2"
top : "pool2"
pooling_param{ kernel_size :2 stride :2 }
}
layer{
name: "fc1"
type : "InnerProduct"
bottom: "pool2"
top : "fc1"
inner_product_param{
num_output:120
weight_filler{ type : "xavier" }
bias_filler{ type : "constant" value:0 }
}
}
layer{
name: "relu_fc1"
type : "ReLU"
bottom: "fc1"
top : "relu_fc1"
relu_param{ negative_slope:0.01 }
}
layer{
name: "drop1"
type : "Dropout"
bottom: "relu_fc1"
top : "drop1"
dropout_param{ dropout_ratio :0.4 }
}
layer{
name: "fc2"
type : "InnerProduct"
bottom: "drop1"
top : "fc2"
inner_product_param{
num_output:84
weight_filler{ type : "xavier" }
bias_filler{ type : "constant" value:0 }
}
}
layer{

38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104

4.3 Designing Using Text Files
name: "relu_fc2"
type : "ReLU"
bottom: "fc2"
top : "relu_fc2"
relu_param{ negative_slope:0.01 }
}
layer{
name: "drop2"
type : "Dropout"
bottom: "relu_fc2"
top : "drop2"
dropout_param{ dropout_ratio :0.4 }
}
layer{
name: " fc3_classification "
type : "InnerProduct"
bottom: "drop2"
top : " classifier "
inner_product_param{
num_output:43
weight_filler{type : "xavier"}
bias_filler{ type : "constant" value:0 }
}
}
layer{
name: "loss"
type : "SoftmaxWithLoss"
bottom: " classifier "
bottom: "label"
top : "loss"
}

149
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135

Considering that there are 43 classes in the GTSRB dataset, the number of neurons
in the classification layer must be also equal to 43. The diagram of final network is
illustrated in Fig. 4.9.
The above protobuf text is stored in a text file on disk. In this example, we store
the above text file in “/home/pc/cnn.prototxt”. The above definition reads the training
samples and feeds them to the network. However, in practice, the network must be
evaluated using a validation set during training in order to assess how good the
network is.
To achieve this goal, the network can be evaluated every K iterations of the training
algorithm. As we will see shortly, this can be easily done by setting a parameter.
Assume, K iterations have been finished and Caffe wants to evaluate the network.
So far, we have only fetched data from a training set. Obviously, we have to tell
Caffe where to look for validation samples. To this end, we add another ImageData
layer right after the first ImageData layer and specify the location of the validation
samples instead of the training samples. In other words, the first layer in the above
network definition will be replaced by:

150

4 Caffe Library

Fig. 4.9 Final architecture of the network. The architecture is similar to the architecture of LeNet-5
in nature. The differences are in activations functions, dropout layer, and connection in middle
layers

layer{
name: "data"
type : "ImageData"
top : "data"
top : "label"
image_data_param{
source : " /home/hamed/Desktop/GTSRB/Training_CNN/ train . txt "
batch_size:30
root_folder : " /home/hamed/Desktop/GTSRB/Training_CNN/ "
is_color : true
shuffle : true
new_width:32
new_height:32
}
}
layer{
name: "data"
type : "ImageData"
top : "data"
top : "label"
image_data_param{
source : " /home/hamed/Desktop/GTSRB/Training_CNN/ validation . txt "
batch_size:10
root_folder : " /home/hamed/Desktop/GTSRB/Validation_CNN/ "
is_color : true
shuffle : false
new_width:32
new_height:32
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

4.3 Designing Using Text Files

151

First, the tops of these two layers have to be identical. This is due to the fact the
first convolution layer is connected to a top called data. If we set top in the second
ImageData layer to another name, the convolution layer will not receive any data
during validation. Second, the variable source in the second layer points to the validation set. Third, the batch sizes of these two layers can be different. Usually, if
memory on the GPU device is limited, we usually set the batch size of training set to
the appropriate value and then set the batch size of the validation set according to the
memory limitations. For instance, we have to set the batch size of validation samples
to 10. Fourth, shuffle must be set to false in order to prevent unequal validations sets.
In fact, the parameters that we will explain in the next section are adjusted such that
the validation set is only scanned once in every test.
However, a user may forget to adjust this parameter properly and some of samples
in validation set are fetched more than one time to the network. In that case, if shuffle
is set to true it is very likely that some samples in two validation steps are not
identical. This makes the validation result inaccurate. We alway want to test/validate
the different models or same models in different time on exactly identical datasets.
During testing, the data has to only come from the first ImageData layer. During
validation, the data has to only come from the second ImageData layer. One missing
piece in the above definition is that how should Caffe understand when to switch from
one ImageData layer to another. There is a variable in definition of LayerParameter
called include which is an instance of NetStateRule.
message NetStateRule {
optional Phase phase = 1;
}

1
2
3

When this variable is specified, Caffe will include the layer based on the state of
training. This can be explained better in an example. Let us update the above two
ImageData layers as follows:
layer{
name: "data"
type : "ImageData"
top : "data"
top : "label"
include{
phase :TRAIN
}
image_data_param{
source : " /home/hamed/Desktop/GTSRB/Training_CNN/ train . txt "
batch_size:30
root_folder : " /home/hamed/Desktop/GTSRB/Training_CNN/ "
is_color : true
shuffle : true
new_width:32
new_height:32
}
phase :
}
layer{
name: "data"
type : "ImageData"
top : "data"
top : "label"
include{
phase :TRAIN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

152

4 Caffe Library
}
image_data_param{
source : " /home/hamed/Desktop/GTSRB/Training_CNN/ validation . txt "
batch_size:10
root_folder : " /home/hamed/Desktop/GTSRB/Validation_CNN/ "
is_color : true
shuffle : false
new_width:32
new_height:32
}

}

27
28
29
30
31
32
33
34
35
36
37

During training a network, Caffe alternatively changes its state between TRAIN and
TEST based on a parameter called test_interval (this parameters will be explain in
the next section). In the TRAIN phase, the second ImageData layer will be discarded
by Caffe. In contrast, the first layer will be discarded and the second layer will be
included in the TEST phase. If the variable include is not instantiated in a layer, the
layer will be included in both phases. We apply the above changes on the text file
and save it
Finally, we add a layer to our network in order to compute the accuracy of the
network on test samples. This is simply done by adding the following definition right
after the loss layer.
layer{
name: "acc1"
type : "Accuracy"
bottom: " classifier "
bottom: "label"
top : "acc1"
include{ phase :TEST }
}

1
2
3
4
5
6
7
8

4.4 Training a Network
In order to train a neural network in Caffe, we have to design another text file and
instantiate a SolverParameter inside this file. All required rules for training a neural
network will be specified using SolverParameter.
message SolverParameter {
optional string net = 24;
optional float base_lr = 5;

1
2
3
4

repeated int32 test_iter = 3;
optional int32 test_interval = 4 [ default = 0];
optional int32 display = 6;

5
6
7
8

optional int32 max_iter = 7;
optional int32 iter_size = 36 [ default = 1];

9
10
11

optional
optional
optional
optional

string lr_policy = 8;
float gamma = 9;
float power = 10;
int32 stepsize = 13;

12
13
14
15
16

optional float momentum = 11;

17

4.4 Training a Network
optional float weight_decay = 12;
optional string regularization_type = 29 [ default = "L2" ] ;
optional float clip_gradients = 35 [ default = −1];

153
18
19
20
21

optional int32 snapshot = 14 [ default = 0];
optional string snapshot_prefix = 15;

22
23
24

enum SolverMode {
CPU = 0;
GPU = 1;
}
optional SolverMode solver_mode = 17 [ default = GPU] ;
optional int32 device_id = 18 [ default = 0];

25
26
27
28
29
30
31

optional string type = 40 [ default = "SGD" ] ;
}

32
33

The string variable net points to the .prototxt file that includes the definition of
the network. In our example, this variable is set to net=”/home/pc/cnn.prototxt”.
The variable base_lr denotes the base learning rate. The effective learning rate at
each iteration is defined based on the value of lr_policy, gamma, power, and stepsize. Recall from Sect. 3.6.4 that the learning rate is usually decreased over time.
We explained different methods for decreasing the learning rate. In Caffe, setting
lr_policy=”exp” will decrease the learning rate using exponential rule. Likewise,
setting this parameter to ”step” and ”inv” will decrease the learning rate using step
method and the inverse method.
The parameter test_iter tells Caffe how many mini-batches it should use during
test phase. The total number of samples that is used in the test phase will be equal
to test_iter × batch size of test ImageData layer. The variable test_iter is usually
set such that the test phase covers all samples of validation set without using a
sample twice. Caffe will change its phase from TRAIN to TEST every test_interval
iterations (mini-batches). Then, it will run the TEST phase for test_iter mini-batches
and changes its phase to TRAIN again.
While Caffe is training the network, it produces human-readable output. The
variable display will show this information in the console and write them into a log
file for every display iterations. Also, the variable max_iter shows the maximum
number of iterations that must be performed by the optimization algorithm. The log
file is accessible in director /tmp in Ubuntu.
Sometimes, because images are large or memory on GPU device is limited, it is
not possible to set mini-batch size of training samples to an appropriate value. On the
other hand, if the size of mini-batch is very small, gradient descend is likely to have
a very zigzag trajectory and in some cases it may even jump over a (local) minimum.
This makes the optimization algorithm more sensitive to the learning rate. Caffe
alleviates this problem by first accumulating gradients of iter_size mini-batches and
updating parameters based on accumulated gradients. This makes it possible to train
large networks when memory on the GPU device is not sufficient.
As it turns out, the variable momentum determines the value of momentum in
the momentum gradient descend. It is usually set to 0.9. The variable weight_decay
shows the value of λ in the L 1 and L 2 regularizations. The type of regularization
is also defined using the string variable regularization_type. This variable can be

154

4 Caffe Library

only set to ”L1” or ”L2”. The variable clip_gradients defines the threshold in the
max-norm regularization method (Sect. 3.6.3.3).
Caffe stores weights and state of optimization algorithm inside a folder at snapshot_prefix for every snapshot iteration. Using these files, you can load the parameters
of the network after training or resume training from a specific iteration.
The optimization algorithm can be executed on CPU of a GPU. This is specified
using the variable solver_mode. In the case that you have more than one graphic cards,
the variable device_id tells Caffe which graphic must be used for computations.
Finally, the string variable type determines the type of optimization algorithm.
In the rest of this book, we will always use ”SGD” which refers to mini-batch
gradient descend. Other optimization algorithms such as Adam, AdaGrad, Nesterov,
RMSProp, and AdaDelta are also implemented in the Caffe library. For our example,
we write the following protobuf text in a file called solver.prototxt.
net : ’ /tmp/cnn. prototxt ’
type : "SGD"

1
2
3

base_lr : 0.01

4
5

test_iter : 50;
test_interval :500;
display : 50

6
7
8
9

max_iter : 30000

10
11

lr_policy : "step"
stepsize :3000
gamma : 0.98

12
13
14
15

momentum :0.9
weight_decay :0.00001

16
17
18

snapshot : 1000
snapshot_prefix : ’cnn’

19
20

After creating the text files for the network architecture and for the optimization
algorithm, we can use command tools of the Caffe library to train and evaluate the
network. Specifically, running the following command in Terminal of Ubuntu will
train the network:
. / c a f f e −master / b u i l d / t o o l s / c a f f e t r a i n −− s o l v e r " /PATH_TO_SOLVER/ s o l v e r . p r o t o t x t "

1

4.5 Designing in Python
Assume we have 100 GPUs in which we can train a big neural network on each of
them, separately. With these resources available, our aim is to generate 1000 different
architectures and train/validate each of them on one of these GPUs. Obviously, it
is not tractable for a human to create 1000 different architectures in text files. The
situation gets even more impractical if our aim is to generate 1000 significantly
different architectures.

4.5 Designing in Python

155

The more efficient solution is to generate these files using a computer program. The
program may use heuristics to create different architectures or it may generate them
randomly. Regardless, the program must generate text files including the definition
of network.
The Caffe library provides a Python interface that makes it possible to use
Caffe functions in a Python program. The Python interface is located at caffemaster/python. If this path is not specified in the PYTHONPATH environment variable, importing the Python module of Caffe will cause an error. To solve this problem, you can either set the environment variable or write the following code before
importing the module:
import sys
sys . path . insert (0 , " /home/pc/ caffe−master /python")
import caffe

1
2
3

In the above script, we have considered that the Caffe library is located at “/home/pc/
caffe-master/ ”. If you open __init__.py from caffe-master/python/caffe/ you will find
the name of functions, classes, objects, and attributes that you can use in your Python
script. Alternatively, you can run the following code to obtain the same information:
import sys
sys . path . insert (0 , " /home/pc/ caffe−master /python")
import caffe

1
2
3
4

print dir ( caffe )

5

In order to design a network, we should work with two attributes called layers and
params and a class called NetSpec. The following Python script creates a ConvNet
identical to the network we created in the previous section.
import sys
sys . path . insert (0 , " /home/hamed/ caffe−master /python")
import caffe

1
2
3
4

L = caffe . layers
P = caffe .params

5
6
7

def conv_relu(bottom , ks , nout , stride =1, pad=0):
c = L. Convolution(bottom , kernel_size=ks , num_output=nout ,
stride=stride , pad=pad,
weight_filler={’type ’ : ’xavier ’} ,
bias_filler={’type ’ : ’constant ’ ,
’value ’ :0})
r = L.ReLU(c)
return c , r

8
9
10
11
12
13
14
15
16

def fc_relu_drop (bottom , nout) :
fc = L. InnerProduct(bottom , num_output=nout ,
weight_filler={’type ’ : ’xavier ’} ,
bias_filler={’type ’ : ’constant ’ ,
’value ’ :0})
r = L.ReLU( fc )
d = L. Dropout( r , dropout_ratio=0.4)
return fc , r , d

17
18
19
20
21
22
23
24
25

net = caffe . net_spec .NetSpec()

26
27

net . data , net . label = L.ImageData(source=’ /home/hamed/Desktop/ train . txt ’ ,
batch_size=30, is_color=True ,

28
29

156

4 Caffe Library
shuffle=True , new_width=32,
new_height=32, ntop=2)

30
31
32

net .conv1, net . relu1 = conv_relu( net . data , 5, 16)
net . pool1 = L. Pooling( net . relu1 , kernel_size=2,
stride =2, pool=P. Pooling .MAX)

33
34
35
36

net .conv2, net . relu2 = conv_relu( net . pool1 , 5, 16)
net . pool2 = L. Pooling( net . relu2 , kernel_size=2,
stride =2, pool=P. Pooling .MAX)

37
38
39
40

net . fc1 , net . fc_relu1 , net . drop1 = fc_relu_drop ( net . pool2 , 120)
net . fc2 , net . fc_relu2 , net . drop2 = fc_relu_drop ( net . drop1 , 84)
net . f3_classifier = L. InnerProduct( net . drop2 , num_output=43,
weight_filler={’type ’ : ’xavier ’} ,
bias_filler={’type ’ : ’constant ’ ,
’value ’:0})
net . loss = L.SoftmaxWithLoss( net . classifier , net . label )

41
42
43
44
45
46
47
48

with open( ’cnn. prototxt ’ , ’w’ ) as fs :
fs . write ( str ( net . to_proto () ) )
fs . flush ()

49
50
51

In general, creating a layer can be done using the following template:
n e t . top1 , n e t . top2 , . . . . , n e t . topN = L . LAYERTYPE( bottom1 , bottom2 , . . . , bottomM ,
kwarg1= v a l u e , kwarge= v a l u e , kwarg= d i c t ( kwarg= v a l u e , . . . ) , . . . , n t o p =N)

1

The number of tops in a layer is determined using the argument ntop. Using this
method, the function will generate ntop top(s) in the output. Hence, there have to be
N variables in the left side assignment operator. The name of tops in the text file will
be “top1”, “top2” and so on. That said, if the first top of the function is assigned to
net.label, it is analogous to putting top=”label” in the text file.
Also, note that the assignments have to be done on net.*. If you study the source
code of NetSpec, you will find that the __setattr__ of this class is designed in a
special way such that executing:
n e t . DUMMY_NAME = v a l u e

1

will actually create an entry in a dictionary with key DUMMY_NAME.
The next point is that calling L.LAYERTYPE will actually create a layer in the text
file where type of the layer will be equal to type=”LAYERTYPE”. Therefore, if we
want to create a convolution layer, we have to call L.Convolution. Likewise, creating
pooling, loss and ReLU layers is done by calling L.Pooling, L.SoftmaxWithLoss,
and L.ReLU, respectively.
Any argument that is passed to L.LAYERTYPE function will be considered as the
bottom of the layer. Also, any keyword argument will be treated as the parameters
of the layer. In the case that there is a parameter in a layer such as weight_filler with
a data type other than basic types, the inner parameters of this parameter can be
defined using a dictionary in Python.
After that the architecture of network is defined, it can be simply converted to a
string by calling str(net.to_proto()). Then, this text can be written into a text file and
stored on disk.

4.6 Drawing Architecture of Network

157

4.6 Drawing Architecture of Network
The Python interface provides a function for generating a graph for a given network
definition text file. This can be done by calling the following function:
import sys
sys . path . insert (0 , " /home/hamed/ caffe−master /python")
import caffe
import caffe .draw
from caffe . proto import caffe_pb2
from google . protobuf import text_format

1
2
3
4
5
6
7

def drawcaffe( def_file , save_to , direction =’TB’ ) :
net = caffe_pb2 . NetParameter ()
text_format .Merge(open( def_file ) . read () , net )

8
9
10
11

caffe .draw. draw_net_to_file ( net , save_to , direction )

12

This function uses the GraphViz Python module to generate the diagram. The parameter direction shows the direction of the graph and it might be called by passing ’TB’
(top-bottom), ’BT’ (bottom-top), ’LR’ (left-right), ’RL’ (right-left). The diagrams
indicated in this chapter are created by calling this function.

4.7 Training Using Python
After creating the solver.prototxt file, we can use it for training the network by writing
a Python script rather than command tools. The Python script for training a network
might look like:
caffe .set_mode_gpu()
solver = caffe . get_solver ( ’ /tmp/ solver . prototxt ’ )
solver . step(25)

1
2
3

The first line in this code tells Caffe to use GPU instead of CPU. If this command is not
executed, Caffe will use CPU by default. The second line in this code loads the solver
definition. Because the path of network is also mentioned inside the solver definition,
the network is also automatically loaded. Then, calling the step(25) function, runs the
optimization algorithm for 25 iterations and stops. Assume that test_interval=100
and we call solver.step(150). If the network is trained using command tools, Caffe
will switch from TRAIN to TEST when immediate after 100th iteration. This will
also happen when solver.step(150) is called. Hence, if you want that the test phase is
not automatically invoked by Caffe, the variable test_interval must be set to a large
number (larger than the variable max_iter).

158

4 Caffe Library

4.8 Evaluating Using Python
Any neural network must be evaluated in three stages. The first evaluation is done
during training using training set. The second evaluation is done during training
using validation set and the third evaluation is done using test set after that designing
and training the network is completely done.
Recall from Sect. 3.5.3 that a network is usually evaluated using a classification
metric. All the classification metrics that we explained in that section are based on
actual labels and predicted labels of samples. Actual labels of samples are already
available in the dataset. However, predicted labels are obtained using the network.
That means in order to evaluate a network using one of the classification metrics, it
is necessary to predict labels of samples. These samples may come from the training
set, the validation set or the test set.
In the case of neural network, we have to feed the samples to the network and
forward them through the network. The output layer shows the score of samples
for each class. For example, the output layer of the network in Sect. 4.5 is called
f3_classifier. We can access the value of the network computed for a sample using
the following command:
solver = caffe . get_solver ( ’ /tmp/ solver . prototxt ’ )
net = solver . net
print net . blobs[ ’ classifier ’ ] . data

1
2
3

In the above script, the first line loads a solver along with the network. The filled
solver.net returns the network that is used for training. In Caffe, a tensor that retains
data is encapsulated in objects of type Blob. The field net.blobs is a dictionary
where keys of this dictionary are the tops of network that we have specified in the
network definition and value of each entry in this dictionary is an instance of Blob.
For example, the top of the classification layer in Sect. 4.5 is called “classifier”. The
command net.blobs[’classifier’] returns the blob associated with this layer.
The tensor of a blob is accessible through the field data. Hence, net.blobs[’KEY’].
data returns the numerical data in a 4D matrix (tensor). This matrix is in fact a
Numpy array. The shape of tensors in Caffe is N × C × H × W , where N denotes
the number of samples in mini-batch and C illustrates the number of channels. As it
turns out, H and W also denote the height and width, respectively.
The batch size of the layer “data” in Sect. 4.5 is equal to 30. Also, this layer loads
color images (3 channels) of size 32 × 32. Therefore, the command net.blobs[’data’].
data returns a 4D matrix of shape 40 × 3 × 32 × 32. Taking into account the
fact that layer “classifier” in this network contains 43 neurons, the command
net.blobs[’classifier’].data will return a matrix of size 40 × 43 × 1 × 1, where each
row in this matrix shows class specific score of the first samples in the mini-batch.
Each sample belongs to the class with the highest score.
Assume we want to classify a single image which is stored at /home/sample.ppm.
This means that, the size of mini-batch is equal to 1. To this end, we have to load
the image in RGB format and resize it to 32 × 32 pixels. Then, transpose the axis
such that the shape of image becomes 3 × 32 × 32. Finally, this matrix has to be

4.8 Evaluating Using Python

159

converted to a 1 × 3 × 32 × 32 matrix in order to make it compatible with tensors
in Caffe. This can be easily done using the following commands:
import numpy as np
im = caffe . io . load_image( ’ /home/sample .ppm’ , color=True)
im = caffe . io . resize (im, (32, 32))
im = np. transpose (im, [2 ,0 ,1])
im = im[np. newaxis , . . . ]

1
2
3
4
5

Next, this image has to be fed into the network and the output layers must be
computed one by one. Technically, this is called forwarding the samples throughout
the network. Assuming that net is an instance of Caffe.Net, forwarding the above
sample can be easily done by calling:
net . blobs[ ’data ’ ] . data [ . . . ] = im [ . . . ]
net . forward ()

1
2

It should be noted that [...] in the above code the image into the memory of field
data. Removing this from the above line will raise an error since it will mean that
we are assigning a new memory to the field data rather than updating its memory.
At this point, net.blobs[top].data returns the output of a top in network. In order to
classify the above image in our network, we only need to run the following line:
label = np.argmax( net . blobs[ ’ classifier ’ ] . data , axis=1)

1

This will return the index of the class with maximum score. The general procedure
for training a ConvNet is illustrated below.
Givens :

1

X_train : A d a t a s e t c o n t a i n i n g N images of s i z e WxHx3
Y_train : A v e c t o r of l e n g t h N c o n t a i n i n g l a b e l s of each samples in X_train

2
3
4

X_valid : A d a t a s e t c o n t a i n i n g K images of s i z e WxHx3
Y_valid : A v e c t o r of l e n g t h K c o n t a i n i n g l a b e l s of each samples in X_valid

5
6
7

FOR t =1 TO MAX
TRAIN THE CONVNET FOR m ITERATIONS USING X_train and Y_train

8
9
10

EVALUATE THE CONVNET USING X_valid and Y_valid
END FOR

11
12

The training procedure involves constantly updating parameters using the training set
and evaluating the network using the validation set. More specifically, the network
is trained for m iterations using the training samples. Then, validations samples are
fetched into the network and a classification metric such as accuracy is computed for
the samples in the validation set. The above procedure is repeated M AX times and
the training is finished. One may wonder why the network must be evaluated during
training. As we will see in the next chapter, validation is a crucial step in training
a classification model such as neural networks. The following code shows how to
implement the above procedure in Python:
solver = caffe . get_solver ( ’solver . prototxt ’ )

1
2

with open( ’ validation . txt ’ , ’ r ’ ) as file_id :
valid_set = csv . reader ( file_id , delimiter=’ ’ )
valid_set = [(row[0] , int (row[1]) ) for row in valid_set ]

3
4
5
6

160

4 Caffe Library

net_valid = solver . test_nets [0]
data_val = np. zeros ( net_valid . blobs[ ’data ’ ] . data . shape , dtype=’float32 ’ )
label_actual = np. zeros ( net_valid . blobs[ ’ label ’ ] . data . shape , dtype=’ int8 ’ )
for i in xrange(500) :
solver . step(1000)

7
8
9
10
11
12

print ’Validating . . . ’
acc_valid = []
net_valid . share_with( solver . net )

13
14
15
16

batch_size = net_valid . blobs[ ’data ’ ] . data . shape[0]
cur_ind = 0

17
18
19

for _ in xrange(800) :
for j in xrange( batch_size ) :
rec = valid_set [cur_ind]
im = cv2. imread( rec [0] , cv2. cv .CV_LOAD_IMAGE_COLOR) . astype ( ’float32 ’ )
im = im / 255.
im = cv2. resize (im, (32, 32))
im = np. transpose (im, [2 ,0 ,1])

20
21
22
23
24
25
26
27

data_val [ j , . . . ] = im
label_actual [ j , . . . ] = rec [1]
cur_ind = cur_ind + 1 i f (( cur_ind+1) < len( valid_set ) ) else 0

28
29
30
31

net_valid . blobs[ ’data ’ ] . data [ . . . ] = data_val
net_valid . blobs[ ’ label ’ ] . data [ . . . ] = label_actual
net_valid . forward ()

32
33
34
35

class_score = net_valid . blobs[ ’ classifier ’ ] . data .copy()
label_pred = np.argmax( class_score , axis=1)
acc = sum( label_actual . ravel () == label_pred ) / float ( label_pred . size )
acc_valid .append(acc)
mean_acc = np. asarray ( acc_valid ) .mean()
print ’Validation accuracy : {}’ .format(mean_acc)

36
37
38
39
40
41

First line loads the solver together with the train and test networks associated with
this solver. Line 3 to Line 5 read the validation dataset into a list. Line 8 and Line 9
create containers for validation samples and their labels. The training loop starts at
Line 10 and it will be repeated 500 times. The first statement in this loop (Line 11)
is to train the network using training samples for 1000 iterations.
After that, validating the network starts at Line 13. The idea is to load 800 minibatches of validation samples, where each mini-batch contains batch_size samples.
The loop from Line 21 to Line 30, loads color images and resize them using OpenCV
functions. It also rescales the pixel intensities to [0, 1]. Rescaling is necessary since
the training samples are also rescaled by setting scale:0.0039215 in the definition of
the ImageData layer.6
The loaded images are transposed and copied to data_val tensor. Label of each
sample is also copied into label_actual tensor. After filling the mini-batch, it is copied
into the first layer of the network in Line 32 and Line 33. Then, it is forwarded
throughout the network at Line 34.

6 It

is possible to load and scale images using functions in caffe.io module. However, it should be
noted that the imread function from OpenCV loads color images in BGR order rather than RGB.
This is similar to the way the ImageData layer loads images using OpenCV. In the case of using
caffe.io.load_image function, we must swap R and B channel before feeding them to the network.

4.8 Evaluating Using Python

161

Line 36 and Line 37 finds the class of each samples and the accuracy of classification is computed on the mini-batch and it is stored in a list. Finally, the mean accuracy
of 800 mini-batches is computed and stored in mean_acc. The above code can be
used as a basic template for training and validating neural network in Python using
Caffe library. It is also possible to keep history of training and validation accuracies
in the above code.
However, there are a few points to bear in mind. First, the same transformations
must be applied on the validation/test samples as we have used for training samples.
Second, the validation samples must be identical every time the network is evaluated. Otherwise, it might not be trivial to assess the network properly. Third, as we
discussed earlier, F1-score can be computed over all validation samples rather than
accuracy.

4.9 Save and Restore Networks
During training, we might want to save and restore the parameters of the network. In
particular, we will need the value of trained parameters in order to load them into the
network and use the network in real-world applications. This can be done by writing
a customized function to read the value of net.params dictionary and save them in a
file. Later, we can load the same values to net.params dictionary.
Another way is to use the built-in functions in Caffe library. Specifically, the
net.save(string filename) and the net.copy_from(string filename) functions saves the
parameters into a binary file and loads them into the network, respectively.
In some cases, we may also want to save information related to the optimizer
such as current iteration, current learning rate, current momentum, etc., besides the
parameters of network. Later, this information can be loaded into the optimizer as
well as the network in order to resume the training from the last stopped point.
Caffe provides solver.snapshot() and solver.restore(string filename) functions for
these purposes.
Assume the field snapshot_prefix is set to "/tmp/cnn" in the solver definition file.
Calling solver.snapshot() will create two files as follows:
/tmp/ cnn_iter_X . caffemodel
/tmp/ cnn_iter_X . solverstate

1
2

where X is automatically replaced by Caffe with the current iteration of the optimization algorithm. In order to restore the state of the optimization algorithm from a
disk, we only need to call solver.restore(filename) with a path to a valid .solverstate
file.

162

4 Caffe Library

4.10 Python Layer in Caffe
One limitation of the Caffe library is that we are obliged to only utilize the implemented layers of this library. For example, the softplus activation function is not
implemented in the current version of the Caffe library. In some cases, we may want
to add a layer with a new function that is not implemented in the Caffe library. The
obvious solution is to implement this layer directly in C++ by inheriting our classes
from classes of the Caffe library. This could be a tedious task especially when the
goal is to quickly implement and test our idea.
A more likely scenario in which having a special layer could be advantageous
when we work with different datasets. For instance, there are thousands of samples
in the GTSRB dataset for the task of traffic sign classification. The bounding box
information of each image is provided using a text file. Apparently, these images
have to be cropped to exactly fit the bounding box before feeding to a classification
network.
This can be done in three ways. The first way is to process whole dataset and
crop each image based on their bounding box information and store them on the
disk. Then, the processed dataset can be used for training/validation/testing the network. The second solution is to process images on the fly and fill each mini-batch
after processing the images. Then these mini-batches can be used for training/validation/testing. However, it should be noted that using this method we will not be
longer able to call the solver.step(int) function with an argument greater than one or
set iter_size to a value greater than one. The reason is that, each mini-batch must be
filled manually using our code. The third method is to develop a new layer which
automatically reads images from the dataset, processes, and passes them to the output
(top) of the layer. Using this method, the solver.step(int) function can be called with
any arbitrary positive number.
The Caffe library provides a special type of layer called PythonLayer. Using this
layer, we are able to develop new layers in Python which can be accessed by Caffe.
A Python layer is configured using an instance of PythonParameter which is defined
as follows:
message PythonParameter {
optional string module = 1;
optional string layer = 2;
optional string param_str = 3 [ default = ’ ’ ] ;
}

1
2
3
4
5

Based on this definition, a Python layer might look like:
layer {
name: "data"
type : "Python"
top : "data"
python_param {
module: "python_layer"
layer : "mypythonlayer"
param_str : "{\ ’param1\ ’:1 , \ ’param2\ ’:2.5} "
}
}

1
2
3
4
5
6
7
8
9
10

4.10 Python Layer in Caffe

163

The variable type of a Python layer must be set to Python. Upon reaching to this
layer, Caffe will look for python_layer.py file next to the .prototxt file. Then, it will
look for a class called mypythonlayer inside this file. Finally, it will pass “ṕaram1’:1,
ṕaram2’:2.5” into this class. Caffe will interact with mypythonlayer using four methods inside this class. Below is the template that must be followed in designing a new
layer in Python.
class mypythonlayer( caffe . Layer) :
def setup ( self , bottom , top) :
pass

1
2
3
4

def reshape( self , bottom , top) :
pass

5
6
7

def forward( self , bottom , top) :
pass

8
9
10

def backward( self , top , propagate_down, bottom) :
pass

11
12

First, the class must be inherited from Caffe.Layer. The setup method will be
called only once when Caffe creates train and test networks. The backward method
is only called during the backpropagation step. Computing the output of each layer
given an input is done by calling net.forward() method. Whenever this method is
called, the reshape and forward methods of the layer will be called automatically.
The reshape method is always called before forward method.
It is noteworthy to draw you attention to the prototype of the backward method.
In contrast to the other three methods, where the first argument is bottom and the
last argument is top, in the backward method the places of these two arguments are
switched. So, a great care must be taken into account in defining the prototype of
the method. Otherwise, you may end up with a layer, where the gradients are not
computed correctly. For instance, let us implement the PReLU activation using the
Python layer. In this implementation, we consider a distinct PReLU activation for
each feature map.
class prelu ( caffe . Layer) :
def setup ( self , bottom , top) :
params = eval( self . param_str)
shape = [1]∗len(bottom[0]. data . shape)
shape[1] = bottom[0]. data . shape[1]
self . axis = range(len(shape) )
del self . axis [1]
self . axis = tuple ( self . axis )

1
2
3
4
5
6
7
8
9

self . blobs . add_blob(∗shape)
self . blobs [0]. data [ . . . ] = params[ ’alpha ’ ]

10
11
12

def reshape( self , bottom , top) :
top [0]. reshape(∗bottom[0]. data . shape)

13
14
15

def forward( self , bottom , top) :
top [0]. data [ . . . ] = np.where(bottom[0]. data > 0,
bottom[0]. data ,
self . blobs [0]. data∗bottom[0]. data )

16
17
18
19
20

def backward( self , top , propagate_down, bottom) :
self . blobs [0]. diff [ . . . ] = np.sum(np.where(bottom[0]. data > 0,
np. zeros (bottom[0]. data . shape) ,

21
22
23

164

4 Caffe Library
bottom[0]. data ) ∗ top [0]. diff ,
axis=self . axis , keepdims=True)
bottom[0]. diff [ . . . ] = np.where(bottom[0]. data > 0,
np. ones(bottom[0]. data . shape) ,
self . blobs [0]. data ) ∗ top [0]. diff

24
25
26
27
28

The setup method converts the param_str value specified in the network definition
into a dictionary. Then, the shape of parameter vector is determined. Specifically, if
the shape of bottom layer is N × C × H × W , the shape of parameter vector must
be 1 × C × 1 × 1. The dimensions of array with length 1 will be broadcasted during
operations by Numpy. Since there are C feature maps in the bottom layer, there must
also be C PReLU activations with different values of α.
In the case of fully connected layers, the bottom layer might be a two-dimensional
array instead of four-dimensional array. The shape variable in this method ensures
that the parameter vector will have a shape consistent with the bottom layer.
The variable axis indicates the axis to which the summation of gradient must be
performed. Again, this axis also must be consistent with the shape of bottom layer.
Line 10 creates a parameter array in which the shape of this array is determined
using the variable shape. Note the unpacking operator in this line. Line 11 initializes
α of all PReLU activations with a constant number. The setup method is called once
and it initializes all parameters of the layer.
The reshape method, determines the shape of top layer in Line 14. The channelwise PReLU activations are applied on the bottom layer and assigned to the top layer.
Note how we have utilized broadcasting of Numpy arrays in order to multiply parameters with the bottom layer. Finally, the backward method computes the gradient
with respect to parameters and gradient with respect to the bottom layer.

4.11 Summary
There are various powerful libraries such as Theano, Lasagne, Keras, mxnet, Torch,
and TensorFlow that can be used for designing and training neural networks including
convolutional neural networks. Among them, Caffe is a library that can be used
for both doing research and developing real-world applications. In this chapter, we
explained how to design and train neural networks using the Caffe library. Moreover,
the Python interface of Caffe was discussed using real examples. Then, we mentioned
how to develop new layers in Python and use them in neural networks.

4.12 Exercises
4.1 Suppose the following text files:
/sample1 . jpg 0
/sample2 . jpg 0
/sample3 . jpg 0

1
2
3

4.12 Exercises

165

/sample4 . jpg
/sample5 . jpg
/sample6 . jpg
/sample7 . jpg
/sample8 . jpg

0
1
1
1
1

4

/sample7 . jpg
/sample1 . jpg
/sample3 . jpg
/sample6 . jpg
/sample4 . jpg
/sample5 . jpg
/sample2 . jpg
/sample8 . jpg

1
0
0
1
0
1
0
1

1

5
6
7
8

2
3
4
5
6
7
8

From optimization algorithm perspective, which one of the above files is appropriate
for passing to an ImageData layer? Also, which of these files hast to be shuffled
before starting the optimization? Why?
4.2 Shifted ReLU activation is given by Clevert et al. (2015):

f (x) =

x −1 x >0
−1
other wise

(4.1)

This activation function is not basically implemented in Caffe. However, you can
implement it using current layers in this library. Use a ReLU layer together with
Bias layer to implement this activation function in Caffe. A bias layer basically adds
a constant to bottom blobs. You can find more information in caffe.proto about this
layer.
4.3 Why and when shuffle of an ImageData layer in TEST phase must be set to
false.
4.4 When setting shuffle to true or false in TEST phase does not matter?
4.5 What happens if we add include to the first convolution layer in the network we
mentioned in this chapter and set phase=TEST for this layer?
4.6 Add codes to the Python script in order to keep the history of training and
validation accuracies and plot them using Python.
4.7 How we can check the gradient of the implemented PReLU layer using numerical methods?
4.8 Implement the softplus activation function using a Python layer.

166

4 Caffe Library

Reference
Clevert DA, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by exponential linear units (ELUs) 1997:1–13. arXiv:1511.07289.pdf

5

Classification of Traffic Signs

5.1 Introduction
Car industry has significantly progressed in the last two decades. Today’s car not
only are faster, more efficient, and more beautiful they are also safer and smarter.
Improvements in safety of cars are mainly due to advances in hardware and software.
From software perspective, cars are becoming smarter by utilizing artificial intelligence. The component of a car which is basically responsible for making intelligent
decisions is called Advanced Driver Assistant System (ADAS).
In fact, ADASs are indispensable part of smart cars and driverless cars. Tasks such
as adaptive cruise control, forward collision warning, and adaptive light control are
performed by an ADAS.1 These tasks usually obtain information from sensors other
than a camera.
There are also tasks which may directly work with images. Drivers fatigue (drowsiness) detection, pedestrian detection, blind spot monitor, drivable lane detection are
some of the tasks that chiefly depends on images obtained by cameras. There is one
task in ADASs which is the main focus of the next two chapters in this book. This
task is recognizing vertical traffic signs.
A driver-less car might not be considered smart if it is not able to automatically
recognize traffic signs. In fact, traffic signs help a driver (human or autonomous) to
conform with road rules and drive the car, safely. In the near future when driver-less
cars will be common, a road might be shared by human drivers as well as driver-less
cars. Consequently, it is rational to expect that driver-less cars at least perform as
good as a human driver.
Humans are good at understanding scene and recognizing traffic signs using their
vision. There are two major goals in designing traffic signs. First, they must be easily
distinguishable from rest of objects in the scene and, second, their meaning must be

1 An

ADAS may perform many other intelligent tasks. The list here is just a few examples.

© Springer International Publishing AG 2017
H. Habibi Aghdam and E. Jahani Heravi, Guide to Convolutional
Neural Networks, DOI 10.1007/978-3-319-57550-6_5

167

168

5 Classification of Traffic Signs

easily perceivable and independent from spoken language. To this end, traffic signs
are designed with a simple geometrical shape such as triangle, circle, rectangle, or
polygon. To be easily detectable from the rest of objects, traffic signs are painted
using basic colors such as red, blue, yellow, black, and white. Finally, the meaning
of traffic signs are mainly carried out by pictographs in the center of traffic signs. It
should be noted that some signs heavily depend on text-based information. However,
we can still think of the texts in traffic signs as pictographs.
Even though classification of traffic signs is an easy task for a human, there
are some challenges in developing an algorithm for this purpose. This challenges are
illustrated in Fig. 5.1. First, the image of traffic signs might be captured from different
perspectives. Second, whether condition may dramatically affect the appearance of
traffic signs. An example is illustrated in the figure where the “no stopping” sign
is covered by snow. Third, traffic signs are being impaired during time. Because
of that color of traffic sign is affected and some artifacts may appear on the sign
which might have a negative impact on the classification score. Fourth, traffic signs
might be partially occluded by another signs or objects. Fifth, the pictograph area
might be manipulated by human which in some cases might change the shape of
the pictograph. The last issue shown in this figure is the pictograph differences of

Fig. 5.1 Some of the challenges in classification of traffic signs. The signs have been collected in
Germany and Belgium

5.1 Introduction

169

the same traffic sign from one country to another. More specifically, we observe that
the “danger: bicycle crossing” sign posses a few important differences between two
countries.
Beside the aforementioned challenges, motion blur caused by sudden camera
movements, shadow on traffic signs, illumination changes, weather condition, and
daylight variations are other challenges in classifying traffic signs.
As we mentioned earlier, traffic sign classification is one of the tasks of an ADAS.
Consequently, their classification must be done in real time and it must consume as
few CPU cycles as possible in order to release the CPU immediately. Last but not the
least, the classification model must be easily scalable so the model can be adjusted to
new classes in the future with a few efforts. In sum, any model for classifying traffic
signs must be accurate, fast, scalable, and fault-tolerant.
Traffic sign classification is a specific case of object classification where objects
are more rigid and two dimensional. Recently, ConvNets surpassed human on classification of 1000 natural objects (He et al. 2015). Moreover, there are other ConvNets
such as Simonyan and Zisserman (2015) and Szegedy et al. (2014a) with close performances compared to He et al. (2015). However, the architecture of the ConvNets
is significantly different from each other. This suggests that the same problem might
be solved using ConvNets with different architectures and various complexities. Part
of the complexity of ConvNets is determined using activation functions. They play
an important role in neural networks since they apply nonlinearities on the output
of the neurons which enable ConvNets to apply series of nonlinear functions on the
input and transform the input into a space where classes are linearly separable. As
we discuss in the next sections, selecting a highly computational activation function
can increase the number of required arithmetic operations of the network which in
sequel increases the response-time of a ConvNet.
In this chapter, we will first study the methods for recognizing traffic signs and
then we will explain different network architectures for the task of traffic signs
classification. Moreover, we will show how to implement and train these networks
on a challenging dataset.

5.2 Related Work
In general, efforts for classifying traffic signs can be divided into traditional classification and convolutional neural network. In the former approach, researchers have
tried to design hand-crafted features and train a classifier on top of these features.
In contrast, convolutional neural networks learn the representation and classification
automatically from data. In this section, we first review the traditional classification
approaches and then we explain the previously proposed ConvNets for classifying
traffic signs.

170

5 Classification of Traffic Signs

5.2.1 Template Matching
Early works considered a traffic sign as a rigid object and classified the query image
by comparing it with all templates stored in the database (Piccioli et al. 1996). Later,
Gao et al. (2006) matched shape features instead of pixel intensity values. In this
work, matching features is done using Euclidean distance function. The problem with
this matching function is that they consider every pixel/feature equally important. To
cope with this problem, Ruta et al. (2010) learned a similarity measure for matching
the query sign with templates.

5.2.2 Hand-Crafted Features
More accurate and robust results were obtained by learning a classification model
over a feature vector. Paclík et al. (2000) produce a binary image depending on
the color of the traffic sign. Then, moment invariant features are extracted from
the binary image and fetched into a one-versus-all Laplacian kernel classifier. One
problem with this method is that the query image must be binarized before fetching
into the classifier. Maldonado-Bascon et al. (2007) addressed this problem by transforming the image into the HSI color space and calculating histogram of Hue and
Saturation components. Then, the histogram is classified using a multiclass SVM. In
another method, Maldonado Bascón et al. (2010) classified traffic signs using only
the pictograph of each sign. Although the pictograph is a binary image, however,
accurate segmentation of a pictogram is not a trivial task since automatic thresholding methods such as Otsu might fail taking into account the illumination variation
and unexpected noise in real-world applications. For this reason, Maldonado Bascón
et al. (2010) trained SVM where the input is a 31 × 31 block of pixels in a gray-scale
version of pictograph. In a more complicated approach, Baró et al. (2009) proposed
an Error Correcting Output Code framework for classification of 31 traffic signs and
compared their method with various approaches.
Before 2011, there was not a public and challenging dataset of traffic signs. Radu
Timofte (2011), Larsson and Felsberg (2011) and Stallkamp et al. (2012) introduced
three challenging datasets including annotations. These databases are called Belgium
Traffic Sing Classification (BTSC), Swedish Traffic Sign, and German Traffic Sign
Recognition Benchmark (GTSRB), respectively. In particular, the GTSRB was used
in a competition and, as we will discuss shortly, the winner method classified 99.46%
of test images correctly (Stallkamp et al. 2012). Zaklouta et al. (2011) and Zaklouta
and Stanciulescu (2012, 2014) extracted Histogram of Oriented Gradient (HOG)
descriptors with three different configurations for representing the image and trained
a Random Forest and a SVM for classifying traffic sings in the GTSRB dataset.
Similarly, Greenhalgh and Mirmehdi (2012), Moiseev et al. (2013), Huang et al.
(2013), Mathias et al. (2013) and Sun et al. (2014) used the HOG descriptor. The
main difference between these works lies in the utilized classification model (e.g.,
SVM, Cascade SVM, Extreme Learning Machine, Nearest Neighbour, and LDA).
These works except (Huang et al. 2013) use the traditional classification approach.

5.2 Related Work

171

In contrast, Huang et al. (2013) utilize a two level classification. In the first level, the
image is classified into one of super-classes. Each super-class contains several traffic
signs with similar shape/color. Then, the perspective of the input image is adjusted
based on its super-class and another classification model is applied on the adjusted
image. The main problem of this method is sensitivity of the final classification to the
adjustment procedure. Timofte et al. (2011) proposed a framework for recognition
and the traffic signs in the BTSC dataset and achieved 97.04% accuracy on this
dataset.

5.2.3 Sparse Coding
Hsu and Huang (2001) coded each traffic sign using the Matching Pursuit algorithm.
During testing, the input image is projected to different set of filter bases to find the
best match. Lu et al. (2012) proposed a graph embedding approach for classifying
traffic signs. They preserved the sparse representation in the original space by using
L 1,2 norm. Liu et al. (2014) constructed the dictionary by applying k-means clustering on the training data. Then, each data is coded using a novel coding input similar
to Local Linear Coding approach (Wang et al. 2010). Recently, a method based on
visual attributes and Bayesian network was proposed in Aghdam et al. (2015). In
this method, we describe each traffic sign in terms of visual attributes. In order to
detect visual attributes, we divide the input image into several regions and code each
region using the Elastic Net Sparse Coding method. Finally, attributes are detected
using a Random Forest classifier. The detected attributes are further refined using a
Bayesian network.
Fleyeh and Davami (2011) projected the image into the principal component
space and find the class of the image by computing the Euclidean distance of the
projected image with the images in the database. Yuan et al. (2014) proposed a novel
feature extraction method to effectively combine color, global spatial structure, global
direction structure, and local shape information. Readers can refer to Møgelmose
et al. (2012) to study traditional approaches of traffic sign classification.

5.2.4 Discussion
Template matching approaches are not robust against perspective variations, aging,
noise, and occlusion. Hand-crafted features has a limited representation power and
they might not scale well if the number of classes increases. In addition, they are not
robust against irregular artifacts caused by motion blurring and weather condition.
This can be observed by the results reported in the GTSRB competition (Stallkamp
et al. 2012) where the best performed solution based on hand-crafted feature was only
able to correctly classify 97.88% of test cases.2 Later, Mathias et al. (2013) improved

2 http://benchmark.ini.rub.de/.

172

5 Classification of Traffic Signs

the accuracy based on hand-crafted features up to 98.53% on the GTSRB dataset.
Notwithstanding, there are a few problems with this method. Their raw feature vector
is a 9000 dimensional vector constructed by applying five different methods. This
high dimensional vector is later projected to a lower dimensional space. For this
reason, their method is time consuming when they are executed on a multi-core
CPU. Note that Table V in Mathias et al. (2013) have only reported the time on
classifiers and it has disregarded the time required for computing feature vectors and
projecting them into a lower dimension space. Considering that the results in Table
V have been computed on the test set of the GTSRB dataset (12630 samples), only
classification of a feature vector takes 48 ms.

5.2.5 ConvNets
ConvNets were utilized by Sermanet and Lecun (2011) and Ciresan et al. (2012a)
in the field of traffic sign classification during the GTSRB competition where the
ConvNet of (Ciresan et al. 2012a) surpassed human performance and won the competition by correctly classifying 99.46% of test images. Moreover, the ConvNet of
(Sermanet and Lecun 2011) ended up in the second place with a considerable difference compared with the third place which was awarded for a method based on
the traditional classification approach. The classification accuracies of the runner-up
and the third place were 98.97 and 97.88%, respectively.
Ciresan et al. (2012a) constructs an ensemble of 25 ConvNets each consists of
1,543,443 parameters. Sermanet and Lecun (2011) create a single network defined by
1,437,791 parameters. Furthermore, while the winner ConvNet uses the hyperbolic
activation function, the runner-up ConvNet utilizes the rectified sigmoid as the activation function. Both methods suffer from the high number of arithmetic operations.
To be more specific, they use highly computational activation functions. To alleviate
these problems, Jin et al. (2014) proposed a new architecture including 1,162,284
parameters and utilizing the rectified linear unit (ReLU) activations (Krizhevsky
et al. 2012). In addition, there is a Local Response Normalization layer after each
activation layer. They built an ensemble of 20 ConvNets and classified 99.65% of test
images correctly. Although the number of parameters is reduced using this architecture compared with the two networks, the ensemble is constructed using 20 ConvNets
which is not still computationally efficient in real-world applications. It is worth mentioning that a ReLU layer and a Local Response Normalization layer together needs
approximately the same number of arithmetic operations as a single hyperbolic layer.
As the result, the run-time efficiency of the network proposed in Jin et al. (2014)
might be close to Ciresan et al. (2012a).
Recently, Zeng et al. (2015) trained a ConvNet to extract features of the image and
replaced the classification layer of their ConvNet with an Extreme Learning Machine
(ELM) and achieved 99.40% accuracy on the GTSRB dataset. There are two issues
with their approach. First, the output of last convolution layer is a 200 dimensional
vector which is connected to 12,000 neurons in the ELM layer. This layer is solely
defined by 200 × 12,000 + 12,000 × 43 = 2,916,000 parameters which makes it

5.2 Related Work

173

impractical. Besides, it is not clear why their ConvNet reduces the dimension of the
feature vector from 250 × 16 = 4000 in Layer 7 to 200 in Layer 8 and then map
their lower dimensional vector to 12,000 dimensions in the ELM layer (Zeng et al.
2015, Table 1). One reason might be to cope with calculation of the matrix inverse
during training of the ELM layer. Finally, since the input connections of the ELM
layer is determined randomly, it is probable that their ConvNets do not generalize
well on other datasets.

5.3 Preparing Dataset
In the rest of this book, we will design different ConvNets for classification of
traffic signs in the German Traffic Sign Recognition Benchmark (GTSRB) dataset
(Stallkamp et al. 2012). The dataset contains 43 classes of traffic signs. Image of
traffic signs are in RGB format and their are stored in Portable Pixel Map (PPM)
format. Furthermore, each image contains only one traffic sign and they vary from
15 × 15 to 250 × 250 pixels. The training set consists of 39,209 images and the test
set contains 12,630 images. Figure 5.2 shows one sample for each class from this
dataset.
It turns out that images of this dataset are collected in real-world conditions.
They possess some challenges such as blurry images, partially occluded signs, low

Fig. 5.2 Sample images from the GTSRB dataset

174

5 Classification of Traffic Signs

resolution, and poor illumination. The first thing to do in any dataset including the
GTSRB dataset is to split the dataset into training set, validation set, and test set.
Fortunately, the GTSRB dataset comes with a separate test and training set. However,
it does not contain a validation set.

5.3.1 Splitting Data
Given any dataset, our first task is to divide it into one training set, one or more
validation sets, and one test set. In the case, that the test set and training set are
drawn from the same distribution we do not usually need more than one validation
set. Simply speaking, set of images are drawn from the same distribution if they are
collected under the same condition. The term condition here may refer to model of
camera, pose of camera with respect to the reference coordinate system, geographical
information of collection images, illumination, and etc. For example, if we collect
training images during daylight and the test images at night, these two sets are now
drawn from the same distribution. As another example, if the training images are
collected in Spain and the test images are collected in Germany, it is likely that
images are not drawn from the same distribution. If training and test are not drawn
from the same distribution, we usually need more than one validation set to assess
our models.
However, for the sake of simplicity, we consider that whole dataset is drawn from
the same distribution. Our task is to divide this dataset into the three sets that we
mentioned above. Before doing that, we have to decide the ratio of each set with
respect to whole dataset. For example, one may split the dataset such that 80% of
samples are assigned to training set, 10% to validation set, and 10% to test set. Other
common choices are 60 − 20 − 20% and 70 − 15 − 15% for training, validation and
test sets, respectively.
The main idea behind splitting data into different sets is to evaluate whether or
not the trained model is generalized on unseen samples. We have discussed in detail
about this in Sect. 3.7. If the number of samples in the dataset is very high and they
are diverse, splitting data with ratio of 80 − 10 − 10% is a good choice. One can take
100 photos from the same traffic sign with slight changes in camera pose. Then, if
this process is repeated for 10 signs the collected dataset will contain 1000 samples.
Even though the number of samples is high the samples might not be diverse. When
the number of samples is very high and they are diverse, these samples adequately
cover the input space so the chance of generalization increases. For this reason, we
might not need a lot of validation or test samples to assess how well the model is
generalized.
Notwithstanding, when the number of samples is low, 60 − 20 − 20% split ratio
might be a better choice since with smaller number of training samples the model
might overfit on training data which can dramatically reduce its generalization. However, when the number of validation and test samples is high, it is possible to assess
the model more accurately.

5.3 Preparing Dataset

175

After deciding about the split ratio, we have to assign each sample in the dataset
into one and only one of these sets. Note that a sample cannot be assigned to more
than one set. Next, we explain three ways for splitting dataset X into disjoint sets
Xtrain , Xvalidation , and Xtest .

5.3.1.1 Random Sampling
In random sampling, samples are selected using uniform distribution. Specifically,
all samples have the same probability to be assigned to one of the sets without
replacement. This method is not deterministic meaning that if we run the algorithm
10 times, we will end up with 10 different training, validation, and test sets. The
easiest way to make this algorithm deterministic is to always seed the random number
generator with a constant value.
Implementing this method is trivial and its complexity is a linear function of
number of samples in the original set. However, if |Xtrain |  |X | it is likely that
the training set does not cover the input space properly so the model may learn
the training data accurately but it does not generalize well on the test samples.
Technically, this may lead to a model with high variance. Notwithstanding, random
sampling is very popular approach and it works well in practice especially when X
is large.

5.3.1.2 Cluster Based Sampling
In cluster based sampling, the input space is first partitioned into K clusters. The
partitioning can be done using common clustering methods such as k-means, c-mean,
and hierarchical clustering. Then, for each cluster, some of the samples are assigned
to Xtrain , some of them are assigned to Xvalidation , and the rest are assigned to
Xtest . This method ensures that each of these three sets covers the whole space
represented by X . Assigning samples from each cluster to any of these sets can be
done using the uniform sampling approach. Again, the sampling has to be without
replacement.
The advantage of this method is that each set adequately covers the input space.
Nonetheless, this method might not be computationally tractable to be applied on
large and high dimensional datasets. This is due to the fact that that clustering algorithms are iterative methods and applying them on large datasets may need a considerable time in order to minimize their cost function.

5.3.1.3 DUPLEX Sampling
DUPLEX is a deterministic method that selects samples based on their mutual
Euclidean distance in the input space. The DUPLEX sampling algorithm is as
follows:

176

Input:
Set of samples X
Outputs:
Training set Xtrain
validation set Xvalidation
Test set Xtest

5 Classification of Traffic Signs

1
2
3
4
5
6
7

Xtrain = ∅
Xvalidation = ∅
Xtest = ∅
FOR Xt ∈ {Xtrain , Xvalidation , X test } REPEAT
x1 , x2 = max x ,x ∈X xi − x j 
 i j
Xt = Xt {x1 , x2 }
X = X − {x1 , x2 }
END FOR
While X = ∅ REPEAT:
FOR Xt ∈ {Xtrain , Xvalidation , X test } REPEAT
IF |Xt | == n t THEN
continue
END IF
x = max x ∈X min x ∈Xt xi − x j 
i
j
Xt = Xt {x}
X = X − {x}
END FOR
END WHILE

8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Listing 5.1 DUPLEX algorithm

In the above algorithm, n t denotes the maximum number of samples in the set. It is
computed based on split ratio of samples. First, it finds the two samples with maximum Euclidean distance and assigns them to the training set. Then, these samples
are removed from the original set. This process is repeated for the validation and test
sets as well (Lines 11–15).
The second loop (Lines 16–22) is repeated until the original set is empty. At each
iteration, it finds the sample from X with maximum distance from the closest sample
in Xt . This sample is added to Xt and removed from X .This procedure is repeated
for Xtrain , Xvalidation , and Xtest , respectively as each iteration.
The DUPLEX algorithm guarantees that each of the sets will cover the input
space. However, as it turns out, this algorithm is not computationally efficient and it
is not feasible to apply it on large and high dimensional datasets.

5.3.1.4 Cross-Validation
In some cases, we may have a special set for testing our models. Alternatively, the
test set Xtest might be extracted from the original set X using one of the methods in
the previous section. Let X  denotes the set after subtracting Xtest from X without
(X  = X − Xtest ).
The aim of cross-validation is to split X  into training and validation sets. In the
previous section, we mentioned how to divide X  into only one training and one
validation set. This is a method that is called hold-out cross-validation where there
are only one training, one validation, and one test set.
Cross-validation techniques are applied on X  rather than X (The test set is never
modified). If number of data in X  is high, the hold-out cross-validation might be

5.3 Preparing Dataset

177

the first choice. However, one can use the random sampling technique and create
more than one training/validation sets. Then, training and validating the model will
be done using each pair of training/validation sets separately and the average of
evaluation will be reported. It should be noted that training the model starts from
scratch with each training/validation pair. This method is called repeated random
sub-sampling cross-validation. This method can be useful in practical applications
since it provides better estimate of generalization of the model. We encourage the
reader to study other cross-validation techniques such as K-fold cross-validation.

5.3.2 Augmenting Dataset
Assume a training image xi ∈ R H ×W ×3 . To human eye, the class of any sample
x j ∈ R H ×W ×3 where xi − x j  < ε is exactly the same as xi . However, in the case
of ConvNets these slightly modified samples might be problematic (Szegedy et al.
2014b). Techniques for augmenting a dataset try to generate several x j for each
samples in the training set Xtrain . This way the model will be able to be adjusted
not only on a sample but also on its neighbors.
In the case of datasets composed of only images, this is analogous to slightly
modifying xi and generating x j in its close neighborhood. Augmenting dataset is
important since it usually improves the accuracy of the ConvNets and makes them
more robust to small changes in the input. We explain the reason on Fig. 5.3.
The middle image is the flipped version of the left image and the right image is
another sample from training set. Denoting the left, middle, and right images with
xl , xm and xr , respectively; their pairwise Euclidean distances are equal to:
xl − xm  = 25,012.5
xl − xr  = 27,639.4
xr − xm  = 26,316.0.

(5.1)

Fig. 5.3 The image in the middle is the flipped version of the image in the left. The image in the
right another sample from dataset. Euclidean distance from the left image to the middle image is
equal to 25,012.461 and the Euclidean distance from the left image to the right image is equal to
27,639.447

178

5 Classification of Traffic Signs

In other words, in terms of Euclidean distance in image space (a R H ×W ×3 space),
these images are located in approximately close distances from each other. We might
have expected that xl − xm  to be much smaller than xl − xr . However, computing the pairwise Euclidean distance between these three samples revels not it is
not always the case.3 Augmenting the training set with flipped images will help the
training set to cover the input space better and this way, it improves the accuracy.
It should be noted that a great care must be taken into account when the dataset is
augmented by flipping images. The reason is that flipping an image may completely
change the class of object. For example, flipping image of “danger: curve to the
right” sign will alter its meaning to “danger: curve to left” sign. There are many
other techniques for augmenting training set with slightly modified samples. Next,
we will explain some of these techniques.

5.3.2.1 Smoothing
Samples can be smoothed using blurring filters such as average filters or Gaussian
filters. Smoothing images mimics the out-of-focus effect in cameras. Augmenting
dataset using this technique makes the model more tolerant to blurring effect of
cameras. Smoothing an image using a Gaussian filter can be simply done using the
OpenCV library:
import cv2
import numpy as np

1
2
3

def smooth_gaussian(im, ks) :
sigma_x = (ks[1] / / 2.) / 3.
sigma_y = (ks[0] / / 2.) / 3.
return cv2. GaussianBlur(im, ksize=ks , sigmaX=sigma_x, sigmaY=sigma_y)

4
5
6
7

Concretely, an image can be smoothed using different kernels sizes. Bigger kernels
makes blurring effect stronger. It is worth mentioning that cv2.GaussianBlur returns
an image with the same size as its input by default. It internally manages the borders
of images. Also, depending on how much you may want to simulate the out-offocus effect you may apply the above function on the same sample with different
kernel sizes.

5.3.2.2 Motion Blur
A camera mounted on a car is in fact a moving camera. Because of that stationary
objects in road appears as moving objects in sequence of images. Depending on the
shutter speed, ISO speed an speed of car, images taken from these objects might
be degraded by camera motion effect. Accurate simulation of this effect might not
be trivial. However, there is a simple approach for simulating motion blur effect
using linear filters. Assume we want to simulate a linear motion where the camera

3 Note

that Euclidean distance in high-dimensional spaces might be close even for far samples.

5.3 Preparing Dataset

179

is moved along a line with orientation θ . To this end, a filter must be created where
all the elements of this filter is zero except the elements lying on the line with
orientation θ . These elements will be assigned 1 and finally the elements of matrix
will be normalized to ensure that the result of convolution will be always within
the valid range of pixel intensities. This function can be implemented in Python as
follows:
def motion_blur(im, theta, ks):
kernel = np.zeros((ks, ks), dtype=’float32’)

1
2
3

half_len = ks // 2
x = np.linspace(−half_len, half_len, 2∗half_len+1,dtype=’int32’)
slope = np.arctan(theta∗np.pi/180.)
y = −np.round(slope∗x).astype(’int32’)
x += half_len
y += half_len

4
5
6
7
8
9
10

kernel[y, x] = 1.0
kernel = np.divide(kernel, kernel.sum())
im_res = cv2.filter2D(im, cv2.CV_8UC3, kernel)

11
12
13
14

return im_res

15

Note that control statements such as “if the size of filter is odd” or “if it is bigger
than a specific size” are removed from the above code. The cv2.filter2D function
handles the border effect internally by default and it returns an image with the same
size as its input. Motion filters might be applied on the same sample with different
orientations and sizes in order to simulate wide range of motion blur effects.

5.3.2.3 Median Filtering
Median filters are edge preserving filters which are used for smoothing images.
Basically, for each pixel on image, all neighbor pixels in a small window are sorted
based on their intensity. Then, value of current pixel is replaced with the median of
the sorted intensities. This filtering approach can be implemented as follows:
def blur_median(im, ks) :
return cv2. medianBlur(im, ks)

1
2

The second parameter in this function is a scaler defining the size of square window
around each pixel. In contrast to the previous smoothing methods, it is not common
to apply median filter with large windows. The reason is that it may not generate
real images taken in real scenarios. However, depending on the resolution of input
images, you may use median filtering with a 7 × 7 kernels size. For low-resolution
images such as traffic signs a 3 × 3 kernel usually produce realistic images. Applying
a median filter with larger kernel sizes may not produce realistic images.

5.3.2.4 Sharpening
Contrary to smoothing, it is also possible to sharpen an image in order to make their
finer details stronger. For example, edges and noisy pixel are two examples of fine

180

5 Classification of Traffic Signs

details. In order to sharpen an image, a smoothed version of the image is subtracted
from the original image. This will give an image where fine detail of image have
higher intensities. The sharpened image is obtained by adding the fine image with
the original image. This can be implemented as follows:
def sharpen(im, ks=(3, 3) , alpha=1):
sigma_x = (ks[1] / / 2.) / 3.
sigma_y = (ks[0] / / 2.) / 3.
im = im. astype ( ’float32 ’ ) ∗ 0.0039215
im_coarse = cv2. GaussianBlur(im, ks , sigmaX=sigma_x, sigmaY=sigma_y)
im_fine = im − im_coarse
im += alpha ∗ im_fine
return np. clip (im ∗ 255, 0, 255) . astype ( ’uint8 ’ )

1
2
3
4
5
6
7
8

Here, the fine image is added using a weight called α. Also, the size of smoothing kernel affects the resulting sharpened image. This function can be applied with
different values of kernel sizes and α on a sample in order to generate different sharpened images. Figure 5.4 illustrates examples of applying smoothing and sharpening
techniques with different configuration of parameters on an image from the GTSRB
dataset.

5.3.2.5 Random Crop
Another effective way for augmenting a dataset is to generate random crops for
each sample in the dataset. This may generate samples that are far from each other
in the input space but they belong to the same class of object. This is a desirable
property since it helps to cover some gaps in the input space. This method is already
implemented in Caffe using the parameter called crop_size.4 However, in some cases
you may develop a special Python layer for your dataset or may want to store random
copies of each samples on disk. In these cases, the random cropping method can be
implemented as follows:
def crop(im, im_shape, rand) :
dx = rand . randint (0 , im. shape[0]−im_shape[0])
dy = rand . randint (0 , im. shape[1]−im_shape[1])
im_res = im[dy:im_shape[0]+dy, dx:im_shape[1]+dx, : ] . copy()
return im_res

1
2
3
4
5

In the above code, the argument rand is an instance of numpy.random.RandomState
object. You may also directly call numpy.random.randint function instead. However,
we can use the above argument in order to seed the random number generator with
a desired value.

5.3.2.6 Saturation/Value Changes
Another technique for generating similar images in far distances is to manipulate
saturation and value components of image in the HSV color space. This can be
4 You

can refer to the caffe.proto file in order to see how to use this parameter.

5.3 Preparing Dataset

181

Fig. 5.4 The original image in top is modified using Gaussian filtering (first row), motion blur
(second and third rows), median filtering (fourth row), and sharpening (fifth row) with different
values of parameters

done by first transforming the image from RGB space to HSV space. Then, the
saturation and value components are manipulated. Finally, the manipulated image
is transformed back into the RGB space. Manipulating the saturation and value
components can be done in different ways. In the following code, we have changed
these components using a simple nonlinear approach:
def hsv_augment(im,scale, p,component):
im_res = im.astype(’float32’)/255.
im_res = cv2.cvtColor(im_res, cv2.COLOR_BGR2HSV)
im_res[:, :, component] = np.power(im_res[:, :, component]∗scale, p)
im_res = cv2.cvtColor(im_res, cv2.COLOR_HSV2BGR)
return im_res

1
2
3
4
5
6

182

5 Classification of Traffic Signs

Setting the argument p to 1 will change the component linearly based on the value of
scale. These two arguments usually take a value within [0.5, 1.5]. A sample might
be modified using different combinations of these parameters. It is worth mentioning
that manipulating the hue component might not produce realistic results. The reason
is that it may change the color of image and produce unrealistic images.
Similarly, you may manipulate the image in other color spaces such as YUV color
space. The algorithm is similar to the above code. The only different is to set the
second argument on cvtColor function to the desired color space and manipulate the
correct component in this space.

5.3.2.7 Resizing
In order to simulate the images taken from a distant object, we can resize a sample
with a scaler factor less than one. This way, the size of image will be reduced.
Likewise, a sample might be upscaled using interpolation techniques. Moreover, the
scale factor along each axis might be different but close to each other. The following
code shows how to implement this method using OpenCV in Python:
def resize (im, scale_x , scale_y , interpolation=cv2.INTER_NEAREST) :
im_res = cv2. resize (im, None, fx=scale_x , fy=scale_y , interpolation=interpolation )
return im_res

1
2
3

Augmenting datasets with this technique is also a good practice. Especially, if the
number of low-resolution images is low, we can simply augment them by resizing
high-resolution images with a small scaler factor.

5.3.2.8 Mirroring
Another effective way for augmenting datasets is to mirror images. This technique
is already implemented in the Caffe library using a parameter called mirror. It can
be also easily implemented as follows:
def flip (im) :
return im[ : , −1::−1, : ] . copy()

1
2

Mirroring usually generates instances in far distances from the original sample.
However, as we mentioned earlier, a great care must be taken into account in using this
technique. While flipping the “give way” or “priority road” signs does not change
their meaning, flipping a “mandatory turn left” signs will completely change its
meaning. Also, flipping “speed limit 100” signs will generate an image without any
meaning from traffic sign perspective. However, flipping images of objects such as
animals and foods are totally a valid approach.

5.3.2.9 Additive Noise
Adding noisy samples are beneficial for two reasons. First, they generate samples
of the same class in relatively far distances from each sample. Second, they teach

5.3 Preparing Dataset

183

our model how to make correct predictions in presence of noise. In general, given
an image x, the degraded image xnoisy can be obtained using the vector ν by adding
this vector to the original image (xnoisy = x + ν). Due to addition operator used for
degrading the image, the vector ν is called an additive noise. It turns out that the size
of ν is identical to the size of x.
Here, they key of degradation is to generate the vector ν. To common ways for
generating this vector is to generate random numbers using uniform or Gaussian
distributions. This can be implemented using the Numpy library in Python as follows:
def gaussian_noise (im, mu=0, sigma=1, rand_object=None) :
noise_mask = rand_object . normal(mu, sigma, im. shape)
return cv2.add(im, noise_mask , dtype=cv2.CV_8UC3)

1
2
3
4

def uniform_noise(im, d_min, d_max, rand_object=None) :
noise_mask = rand_object . uniform(d_min, d_max, im. shape)
return cv2.add(im, noise_mask , dtype=cv2.CV_8UC3)

5
6
7

In the above code, a separate noise vector is generated for each channel. The noisy
images generated with this approach might not be very realistic. Nonetheless, they
are useful for generating samples in relatively far distances. The noise vector can be
shared between the channels. This will produce more realistic images.
def gaussian_noise_shared(im, mu=0, sigma=1, rand_object=None) :
noise_mask = rand_object . normal(mu, sigma, (im. shape[0] , im. shape[1] ,1) )
noise_mask = np. dstack ((noise_mask , noise_mask , noise_mask) )
return cv2.add(im, noise_mask , dtype=cv2.CV_8UC3)

1
2
3
4
5

def uniform_noise_shared(im, d_min, d_max, rand_object=None) :
noise_mask = rand_object . uniform(d_min, d_max, (im. shape[0] , im. shape[1] ,1) )
noise_mask = np. dstack ((noise_mask , noise_mask , noise_mask) )
return cv2.add(im, noise_mask , dtype=cv2.CV_8UC3)

6
7
8
9

The addition operator can be implemented using numpy.add function. In the case
of using this function, the type of inputs must be appropriately selected to avoid the
overflow problem. Also, the outputs must be clipped withing a valid range using
numpy.clip function. The cv2.add function from OpenCV takes care of all these
conditions internally. Due to random nature of this method, we can generate millions
of different noisy samples for each sample in the dataset.

5.3.2.10 Dropout
The last technique that we explain in this section is to generate noisy samples by
randomly zeroing some of pixels in the image. This can be done using two different
way. The first way is to connect a Dropout layer to an input layer in the network
definition. Alternatively, this can be done by generating a random binary mask using
the binomial distribution and multiplying the mask with input image.
def dropout(im, p=0.2, rand_object=None) :
mask = rand_object . binomial(1 , 1 − p, (im. shape[0] , im. shape[1]) )
mask = np. dstack ((mask,mask,mask) )
return np. multiply (im. astype ( ’float32 ’ ) , mask) . astype ( ’uint8 ’ )

1
2
3
4

184

5 Classification of Traffic Signs

Using the above implementation all channels of the selected pixels are zeroed
making them completely darks pixels. You may want to dropout channels of selected
pixels randomly. In other words, instead of sharing the same mask between all channel, a separate mask for each channel can be generated. Figure 5.5 shows a few
examples of augmentations with different configurations applied on the sample from
the previous figure.

Fig. 5.5 Augmenting the sample in Fig. 5.4 using random cropping (first row), hue scaling (second
row), value scaling (third row), Gaussian noise (fourth row), Gaussian noise shared between channels
(fifth row), and dropout (sixth row) methods with different configuration of parameters

5.3 Preparing Dataset

185

5.3.2.11 Other Techniques
The above methods are common techniques used for augmenting datasets. There are
many other methods that can be used for this purpose. Contrast stretching, histogram
equalization, contrast normalization, rotating and shearing are some of these methods
that can be used for this purpose. In general, depending on the application, you can
design new algorithms for synthesizing new images and augmenting datasets.

5.3.3 Static Versus One-the-Fly Augmenting
There are two ways for augmenting a dataset. In the first technique, all images in the
dataset are processed and new images are synthesized using above methods. Then,
the synthesized images are stored on disk. Finally, a text file containing the path to
original image as well as synthesized images along with their class labels is created
to pass to the ImageData layer in the network definition file. This method is called
static augmenting. Assume that 30 images are synthesized for each sample in dataset.
Storing these images on disk will make the dataset 30 times larger meaning that its
required space on disk will be 30 times larger.
Another method is to create a PythonLayer in the network definition file. This
layer connects to the database and loads the images into memory. Then, the loaded
images are synthesized using the above methods and fed to the network. This method
of synthesizing is called on-the-fly augmenting. The advantage of this method is that
it does not require more space on disk. Also, in the case of adding new methods
for synthesizing images, we do not need to process the dataset and store them on
disk. Rather, it is simply used in the PythonLayer to synthesize loaded images. The
problem with this method is that it increases the size of mini-batch considerably if
the synthesized images are directly concatenated to the mini-batch. To alleviate this
problem, we can alway keep the size of mini-batch constant by randomly picking
the synthesizing methods or randomly selecting N images from pool of original and
synthesized images in order to fill the mini-batch of size N .

5.3.4 Imbalanced Dataset
Assume that you are asked to develop a system for recognizing traffic signs. The first
task is to collect images of traffic signs. For this purpose, a camera can be attached
to a car and images of traffic signs can be stored during driving the car. Then, these
images are annotated and used for training classification models. Traffic signs such
as “speed limit 90” might be very frequent. In contrast, if images are collected from
coastal part of Spain, the traffic sign “danger: snow ahead” might be very scarce.
Therefore, while the “speed limit 90” sign will appear frequently in the database, the
“danger: snow ahead” sign may only appear few times in the dataset.
Technically, this dataset is imbalanced meaning that number of samples in each
class varies significantly. A classification model trained on an imbalanced dataset

186

5 Classification of Traffic Signs

is likely not to generalize well on the test set. The reason is that classes with more
samples contribute to the loss more than classes with a few samples. In this case,
the model is very likely to learn to correctly classify the classes with more samples
so that the loss function is minimized. As the result, the model might not generalize
well on the classes with much less samples.
There are different techniques for partially solving this problem. The obvious
solution is to collect more data from classes with less samples. However, this might
be a very costly and impractical approach in terms of time and resources. There
are other approaches which are commonly used for training a model on imbalanced
datasets.

5.3.4.1 Upsampling
In this approach, samples of smaller classes are copied in order to match the number
of samples of the largest class. For example, if there are 10 samples in class 1 and 85
samples in class 2, samples of class 1 are copied so that there will be 85 samples in
class 1 as well. Copying samples can be done by random sampling with replacement.
It may be also done using a deterministic algorithm. This method is called upsampling
since it replicates samples in the minority class.

5.3.4.2 Downsampling
Downsampling is the opposite of upsampling. In this method, instead of copying
samples from minority class, some samples from majority class are removed from
dataset in order to match number of samples in minority class. Downsampling can
be done by randomly removing samples or applying a deterministic algorithm. One
disadvantage of this method is that important information might be removed from
dataset by removing samples from majority classes.

5.3.4.3 Hybrid Sampling
Hybrid sampling is combination of the two aforementioned methods. Concretely,
some of majority classes might be downsampled and some of minority classes might
be upsampled such that they all have a common number of samples. This is more
practical approach than just using one of the above sampling methods.

5.3.4.4 Weighted Loss Function
Another approach is to add a penalizing mechanism to the loss function such that a
sample from minority class will contribute more than the sample from majority class
to the loss function. In other words, assume that error of a sample from minority class
is equal to e1 and the error of sample from majority class is equal to e2 . However,
because the number of samples in the minority class is less, we want that e1 has more
impact on loss function compared with e2 . This can be simply done by incorporating

5.3 Preparing Dataset

187

a specific weight for each class in dataset and multiplying error of each sample with
their corresponding weight. For example, assuming that the weight of minority class
is w1 and the weight of majority class is w2 , the error terms the above samples will
be equal to w1 × e1 and w2 × e2 , respectively. Apparently, w1 > w2 so that the error
of one sample from minority class will contribute to the loss more than the error of
one sample from majority class. Notwithstanding, because the number of samples
in majority class is higher, the overall contribution of samples from minority and
majority classes on loss function will be approximately equal.

5.3.4.5 Synthesizing Data
The last method that we describe in this section is synthesizing data. To be more
specific, minority classes can be balanced with majority classes by synthesizing data
on minority classes. Any of the methods for augmenting dataset might be used for
synthesizing data on minority classes.

5.3.5 Preparing the GTSRB Dataset
In this book, the training set is augmented using some of the methods mentioned in
previous sections. Then, 10% of the augmented training set is used for validation.
Also, the GTSRB dataset comes with a specific test set. The test set is not augmented
and it remains unchanged. Next, all samples in the training, validation and test sets
are cropped using the bounding box information provided in the dataset. Then, all
these samples are resized according to the input size of ConvNets that we will explain
in the rest of this book.
Next, mean image is obtained over the training set5 and it is subtracted from
each image in order to shift the training set to origin. Also, the same transformation
is applied on the validation and test sets using the mean image learned from the
training set. Previous study in Coates and Ng (2012) suggests that subtracting mean
image increases the performance of networks. Subtracting mean is commonly done
on-the-fly by storing the mean image as a .bindaryproto file and setting the mean_file
parameter in the network definition file.
Assume that the mean image is computed and stored in a Numpy matrix. This
matrix can be stored in a .binaryproto file by calling the following function:

5 Considering an RGB image as a three-dimensional matrix, mean image is computed by adding all

images in the training set in element-wise fashion and dividing each element of the resulting matrix
with the number of samples in the training set.

188

5 Classification of Traffic Signs

1

def write_mean_file(mean_npy, save_to) :

2
3

i f mean_npy.ndim == 2:
mean_npy = mean_npy[np. newaxis , np. newaxis , . . . ]
else :
mean_npy = np. transpose (mean_npy, (2 , 0, 1) ) [np. newaxis , . . . ]

4
5
6
7
8

binaryproto_file = open(save_to , ’wb’ )
binaryproto_file . write ( caffe . io . array_to_blobproto (mean_npy) . SerializeToString () )
binaryproto_file . close ()

9
10
11

The GTSRB dataset is an imbalanced dataset. In this book, we have applied the
upsampling technique for making this dataset balanced. Samples are picked randomly
for being copied on minority classes. Finally, separate text files containing path of
images and their corresponding class labels are created for training, validation, and
test sets.

5.4 Analyzing Training/Validation Curves
Plotting accuracy of model on training and validation sets at different iterations
during training phase provides diagnostic information about the model. Figure 5.6
shows three different scenarios that might happen during training. First, there is
always an expected accuracy and we always try to achieve this accuracy. The plot on
the left shows an acceptable scenario where the accuracy of model on both training
and validation sets are close to the expected accuracy. In this case, the model can be
thought appropriate and it might be applied on the test set.
The middle plot indicates a scenario where training and validation error are close
to each other but the they are both far from the expected accuracy. In this case, we
can conclude that the current model suffers from high bias. In this case, capacity of
model can be increased by adding more neurons/layers to the model. Other solutions
for this scenario is explained in Sect. 3.7.

Fig. 5.6 Accuracy of model on training and validation set tells us whether or not a model is
acceptable or it suffers from high bias or high variance

5.4 Analyzing Training/Validation Curves

189

The right plot illustrates a scenario where the accuracy on training set is very
close to expected accuracy but the accuracy on validation set is far from the expected
accuracy. This is a scenario where the model suffers from high variance. The quick
solution for this issue is to reduce the model capacity of regularize it more. Other
solutions for this problem are explained in Sect. 3.7. Also, this scenario may happen
because training and validation sets are not drawn from the same distribution.
It is always is good practice to monitor training an validation accuracies during
training. The vertical dashed line in the right plot shows the point in training where
the model has started to overfit on data. This is a point where the training procedure
can be stopped. This technique is called early stopping and it is an efficient way to
save time in training a model and avoid overfitting.

5.5 ConvNets for Classification of Traffic Signs
In this section, we will explain different architectures for classification of traffic signs
on the GTSRB dataset.6 All architectures will be trained and validated on the same
training and validation sets. For this reason, they will share the same ImageData
layers for the training and validation ConvNets. These two layers are defined as
follows:
def gtsrb_source ( input_size=32):
shared_args = { ’ is_color ’ :True ,
’ shuffle ’ :True ,
’new_width’ : input_size ,
’new_height’ : input_size ,
’ntop ’ :2 ,
’transform_param’ :{ ’ scale ’ : 1. / 255}}
L = caffe . layers
net_v = caffe .NetSpec()
net_v . data , net_v . label = L.ImageData(source=’ /home/pc/ train . txt ’ ,
batch_size=200,
include={’phase’ : caffe .TEST} ,
∗∗shared_args )
net_t = caffe . net_spec .NetSpec()
net_t . data , net_t . label = L.ImageData(source=’ /home/pc/ validation . txt ’ ,
batch_size=48,
include={’phase’ : caffe .TRAIN} ,
∗∗shared_args )
return net_t , net_v

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Later, this function will be used to define the architecture of ConvNets. Since the
transformations applied on the training and validation samples are identical, we have
created a dictionary of shared parameters and passed it as a keyword argument after
unpacking it using the ** operator. Also, depending on the available memory of the
graphic card you might need to reduce the batch size of the validation data and set
it to a number smaller than 200. Finally, unless the above function is called by an
integer argument the input images will be resized into 32 × 32 pixels by default.
6 Implementations

of the methods in this chapter are available at https://github.com/pcnn.

190

5 Classification of Traffic Signs

Fig.5.7 A ConvNet consists of two convolution-hyperbolic activation-pooling blocks without fully
connected layers. Ignoring the activation layers, this network is composed of five layers

We start with a very small ConvNet. The architecture of this ConvNet is illustrated
in Fig. 5.7. This ConvNet has two blocks of convolution-activation-pooling layers
where the Hyperbolic tangent function is used as activation of neurons. Without
counting the activation layers, the depth of this ConvNet is 5. Also, the width of
ConvNet (number of neurons in each layer) is not high. The last layer has 43 neurons
one for each class in the GTSRB dataset.
This ConvNet is trained by minimizing the multiclass logistic loss function. In
order to generate the definition file of this network and other networks in this chapter
using Python, we can use the following auxiliary functions.
def conv_act(bottom , ks , nout , act=’ReLU’ , stride =1, pad=0, group=1):
L = caffe . layers
c = L. Convolution(bottom , kernel_size=ks , num_output=nout ,
stride=stride , pad=pad,
weight_filler={’type ’ : ’xavier ’} ,
bias_filler={’type ’ : ’constant ’ ,
’value ’ :0} ,
param=[{’decay_mult’:1},{ ’decay_mult’ :0}] ,
group=group)
r = eval( ’L.{}(c) ’ .format( act ) )
return c , r

1
2
3
4
5
6
7
8
9
10
11
12

def conv_act_pool(bottom , conv_ks, nout , conv_st=1, conv_p=0, pool_ks=2, pool_st=2, act=’ReLU’ ,
group=1):
L = caffe . layers
P = caffe .params
c , r = conv_act(bottom , conv_ks, nout , act , conv_st ,conv_p, group=group)
p = L. Pooling( r ,
kernel_size=pool_ks ,
stride=pool_st ,
pool=P. Pooling .MAX)
return c , r ,p

13
14
15
16
17
18
19
20
21
22

def fc_act_drop (bottom , nout , act=’ReLU’ , drop_ratio=0.5) :
L = caffe . layers
P = caffe .params
fc = L. InnerProduct(bottom , num_output=nout ,
weight_filler={’type ’ : ’xavier ’} ,
bias_filler={’type ’ : ’constant ’ ,
’value ’ :0} ,
param=[{’decay_mult’ : 1}, {’decay_mult’ : 0}])
r = eval( ’L.{}( fc ) ’ .format( act ) )
d = L. Dropout( r , dropout_ratio=drop_ratio )
return fc , r , d

23
24
25
26
27
28
29
30
31
32
33
34

def fc (bottom , nout) :
L = caffe . layers
return L. InnerProduct(bottom ,

35
36
37

5.5 ConvNets for Classification of Traffic Signs
num_output=nout ,
weight_filler={’type ’ : ’xavier ’} ,
bias_filler={’type ’ : ’constant ’ ,
’value ’ : 0})

191
38
39
40
41

The function conv_act creates a convolution layer and an activation layer and
connects the activation layer to the convolution layer. The function conv_act_pool
creates a convolution-activation layer and connects a pooling layer to the activation
layer. The function fc_act_drop creates a fully connected layer and attaches an activation layer to it. It also connects a dropout layer to the activation layer. Finally, the
function fc creates only a fully connected layer without an activation on top of it.
The following Python code creates the network shown in Fig. 5.7 using the above
functions:
def create_lightweight (save_to) :
L = caffe . layers
P = caffe .params
n_tr , n_val = gtsrb_source ( input_size=32, mean_file=’ /home/pc/gtsr_mean_32x32 . binaryproto ’ )
n_tr . c1 , n_tr . a1 , n_tr .p1 = conv_act_pool( n_tr . data , 5, 6, act=’TanH’ )
n_tr . c2 , n_tr . a2 , n_tr .p2 = conv_act_pool( n_tr .p1, 5, 16, act=’TanH’ )
n_tr . f3_classifier = fc ( n_tr .p2, 43)
n_tr . loss = L.SoftmaxWithLoss( n_tr . f3_classifier , n_tr . label )
n_tr . acc = L.Accuracy( n_tr . f3_classifier , n_tr . label )

1
2
3
4
5
6
7
8
9
10

with open(save_to , ’w’ ) as fs :
s_proto = str (n_val . to_proto () ) + ’ \n’ + str ( n_tr . to_proto () )
fs . write ( s_proto )
fs . flush ()

11
12
13
14

The Accuracy layer computes the accuracy of predictions on the current minibatch. It accepts actual label of samples in the mini-batch together with their predicted scores computed by the model and returns the fraction of samples that are
correctly classified. Assuming that name of the solver definition file of this network
is solver_XXX.prototxt, we can run the following script to train and validate the
network and monitor the train/validation performance.
import caffe
import numpy as np
import matplotlib . pyplot as plt

1
2
3
4

solver = caffe . get_solver ( root + ’solver_{}. prototxt ’ .format(net_name) )

5
6

train_hist_len = 5
test_interval = 250
test_iter = 100
max_iter = 5000

7
8
9
10
11

fig = plt . figure (1 , figsize =(16,6) , facecolor=’w’ )
acc_hist_short = []
acc_hist_long = [0]∗max_iter
acc_valid_long = [0]
acc_valid_long_x = [0]

12
13
14
15
16
17

for i in xrange(max_iter) :
solver . step (1)

18
19
20

loss = solver . net . blobs[ ’ loss ’ ] . data .copy()

21
22

192

5 Classification of Traffic Signs
acc = solver . net . blobs[ ’acc ’ ] . data .copy()
acc_hist_short .append(acc)
i f len( acc_hist_short ) > train_hist_len :
acc_hist_short .pop(0)
acc_hist_long [ i ] = (np. asarray ( acc_hist_short ) ) .mean()∗100

23
24
25
26
27
28

i f i > 0 and i % 10 == 0:
fig . clf ()
ax = fig . add_subplot(111)
a3 = ax . plot ([0 , i ] , [100, 100], color=’k’ , label=’Expected’ )
a1 = ax . plot ( acc_hist_long [ : i ] , color=’b’ , label=’Training ’ )
a2 = ax . plot (acc_valid_long_x , acc_valid_long , color=’ r ’ , label=’Validation ’ )
plt . xlabel ( ’ iteration ’ )
plt . ylabel ( ’accuracy (%)’ )
plt . legend( loc=’lower right ’ )
plt . axis ([0 , i , 0, 105])
plt .draw()
plt .show(block=False )
plt . pause(0.005)

29
30
31
32
33
34
35
36
37
38
39
40
41
42

i f i > 0 and i % test_interval == 0:
acc_valid = [0]∗ test_iter
net = solver . test_nets [0]
net . share_with( solver . net )
for j in xrange( test_iter ) :
net . forward ()
acc = net . blobs[ ’acc ’ ] . data .copy()
acc_valid [ j ] = acc
acc_valid_long .append(np. asarray ( acc_valid ) .mean()∗100)
acc_valid_long_x .append( i )
print ’Validation accuiracy : ’ , acc_valid_long[−1]

43
44
45
46
47
48
49
50
51
52
53

The above template can be used as a reference template for training and validating
Caffe models in Python. Line 5 loads the information about the optimization algorithm as well as training and test networks. Depending on the value of the field type
in the solver definition this function returns a different instance. For example, if value
of type is set to “SGD”, it will return an instance of SGDSolver. Likewise, if it is
set to “RMSProp” it will return an instance of RMSPropSolver. All these objects are
inherited from the same class and they share the same methods. Hence, regardless
of type of solver, the method will be called in the above code to run the optimization
algorithm.
The accuracy layer in the network always returns the accuracy of the current minibatch. In order to compute the accuracy over more than one mini-batch, we have to
compute the mean of accuracies of these mini-batches. In the above algorithm, the
mean accuracy is computed over the last train_hist_len mini-batches.
To prevent Caffe from invoking the test network automatically, the field test_
interval in the solver definition file must be set to a very large number. Also, the
variable test_iter can be set to an arbitrary number. Its value does not have any effect
on the optimization algorithm since the variable test_interval is set to a large number
and the test phase will not be invoked by Caffe at all.
The variables in Lines 8–10 denote the test interval, number of mini-batches in
the test interval, and maximum number of iterations in our algorithm. Also, the
variables in Lines 13–16 will keep the mean accuracies of training samples and
validation samples.

5.5 ConvNets for Classification of Traffic Signs

193

Fig. 5.8 Training, validation curve of the network illustrated in Fig. 5.7

The optimization loop starts in Line 18 and it is repeated max_iter times. First
line in the loop runs the forward and backward steps on one mini-batch from the
training set. Then, the loss of network on the current mini-batch is obtained in Line
21. Similarly, the accuracy of network on the current mini-batch is obtained on Line
23. Lines 24–27 stores accuracy of last train_hist_len mini-batches and updates the
mean training accuracy of the current iteration.
Lines 29–41 draw the training, validation, and expected curves every 10 iterations.
Most of time, it is a good practice to visually inspect these curves in order to stop
the algorithm earlier if it is necessary.7 Lines 43–53 validates the network every
test_interval iterations using the validation set. Each time, it computes the mean
accuracy over all mini-batches in the validation set.
Figure 5.8 shows the training/validation curve of the network in Fig. 5.7. According to the plot, the validation error is plateaued after 1000 iterations. Besides, the
training error also is not reduced afterwards. In the case of traffic signs classification problem, it is expected to achieve 100% accuracy. Nonetheless, the training
and validation error is much higher than the expected accuracy. The main reason is
that the network in Fig. 5.7 has a very limited capacity. This is due to the fact that
depth and width of network are low. Specifically, the number of neurons in each
layer is very low. Also, the depth of network could be increased by adding more
convolution-activation-pooling blocks and fully connected layers.
There was a competition for classification of traffic signs on the GTSRB dataset.
The network in Ciresan et al. (2012a) won the competition and surpassed human
accuracy on this dataset. The architecture of this network is illustrated in Fig. 5.9.
In order to be able to add one more convolution-pooling-activation block to the
network in Fig. 5.7 the size of input image must be increased so that the spatial size
of feature maps after the second convolution-activation-pooling block is big enough

7 If

the number of iterations is high, the above code should be changed slightly in order to always
plot fixed number of points.

194

5 Classification of Traffic Signs

Fig. 5.9 Architecture of the network that won the GTSRB competition (Ciresan et al. 2012a)

to apply another convolution-activation-pooling block on this feature maps. For this
reason, the size of input has increased from 32 × 32 pixels in Fig. 5.7 to 48 × 48
pixels in Fig. 5.9.
Also, the first layer has a bigger receptive field and it has 100 filters rather than 6
filters in the previous network. The second block has 150 filters of size 4 × 4 which
yields a feature maps of size 150 × 9 × 9. The third convolution-activation-pooling
block consists of 250 filters of size 4 × 4. The output of this block is a 250 × 3 × 3
feature map.
Another improvement on this network is utilizing a fully connected layer between
the last pooling and the classification layer. Specifically, there is a fully connected
layer with 300 neurons where each neuron is connect to 250 × 3 × 3 neurons from
the previous layer. This network can be define in Python as follows:
def create_net_jurgen (save_to) :
L = caffe . layers
P = caffe .params
net , net_valid = gtsrb_source ( input_size=48,
mean_file=’ /home/hamed/Desktop/GTSRB/Training_CNN/gtsr_mean_48x48 . binaryproto ’ )
net .conv1, net . act1 , net . pool1 = conv_act_pool( net . data , 7, 100, act=’TanH’ )
net .conv2, net . act2 , net . pool2 = conv_act_pool( net . pool1 , 4, 150, act=’TanH’ )
net .conv3, net . act3 , net . pool3 = conv_act_pool( net . pool2 , 4, 250, act=’TanH’ )
net . fc1 , net . fc_act , net . drop1 = fc_act_drop ( net . pool3 , 300, act=’TanH’ )
net . f3_classifier = fc ( net . drop1 , 43)
net . loss = L.SoftmaxWithLoss( net . f3_classifier , net . label )
net . acc = L.Accuracy( net . f3_classifier , net . label )

1
2
3
4
5
6
7
8
9
10
11
12
13

with open(save_to , ’w’ ) as fs :
s_proto = str ( net_valid . to_proto () ) + ’ \n’ + str ( net . to_proto () )
fs . write ( s_proto )
fs . flush ()
print s_proto

14
15
16
17
18

After creating the network, a solver must be created for this network. Then, it
can be trained and validated using the script we mentioned earlier by loading the
appropriate solver definition file. Figure 5.10 shows the training/validation curve of
this network.
As it turns out from the training/validation curve, the above architecture is appropriate for classification of traffic signs in the GTSRB dataset. According to the curve,
the training error is getting close to zero and if the network is trained longer, the training error might reduce more. In addition, the validation accuracy is ascending and
with a longer optimization, the accuracy is likely to improve as well.
Size of the receptive field and number of filters in all layers are chosen properly
in the above network. Also, flexibility (nonlinearity) of the network is enough for
modeling a wide range of traffic signs. However, the number of parameters in this

5.5 ConvNets for Classification of Traffic Signs

195

Fig. 5.10 Training/validation curve of the network illustrated in Fig. 5.9

network could be reduced in order to make it computationally and memory wise
more efficient.
The above network utilizes the hyperbolic tangent function to compute neuron
x
−x
2x
= ee2x −1
. Even
activations. The hyperbolic function is defined as tanh(x) = ee x −e
+e−x
+1
x
with an efficient implementation of exponentiation e , it still requires many multiplications. Note that x is a floating point number since it is the weighted sum of the
input to the neuron. For this reason, e x cannot be implemented using a lookup table.
An efficient way to calculate e x is as follows: First, write x = xint + r , where xint
is the nearest integer to x and r ∈ [−0.5 . . . 0.5] which gives e x = e xint × er . Second,
multiply e by itself xint times. The multiplication can be done quite efficiently.
To further increase efficiency, various integer powers of e can be calculated and
stored in a lookup table. Finally, er can be estimated using the polynomial er = 1 +
2
3
x4
x5
x + x2 + x6 + 24
+ 120
with estimation error +3e−5 . Consequently, calculating
tanh(x) needs [x] + 5 multiplications and 5 divisions. We assuming that division
and multiplication need the same amount of CPU cycles. Therefore, tanh(x) can
be computed using [x] + 10 multiplications. The simplest scenario is when x ∈
[−0.5 . . . 0.5]. Then, tanh(x) can be calculated using 10 multiplications. Based on
this, the total number of multiplications of the network proposed in Ciresan et al.
(2012a) is equal to 128,321,700. Since they build an ensemble of 25 networks, thus,
the total number of the multiplications must be multiplied by 25 which is equal
to 3,208,042,500 multiplications for making a prediction using an ensemble of 25
networks shown in Fig. 5.9.
Aghdam et al. (2016a) aimed to reduce the number of parameters together with
the number of the arithmetic operations and increase the classification accuracy. To
this end, they replaced the hyperbolic nonlinearities with the Leaky ReLU activation
functions. Beside the favorable properties of ReLU activations, they are also computationally very efficient. To be more specific, a Leaky ReLU function needs only one
multiplication in the worst case and if the input of the activation function is positive,
it does not need any multiplication. Based on this idea, they designed the network
illustrated in Fig. 5.11.

196

5 Classification of Traffic Signs

Fig. 5.11 Architecture of network in Aghdam et al. (2016a) along with visualization of the first
fully connected layer as well as the last two pooling layers using the t-SNE method. Light blue,
green, yellow and dark blue shapes indicate convolution, activation, pooling, and fully connected
layers, respectively. In addition, each purple shape shows a linear transformation function. Each
class is shown with a unique color in the scatter plots

5.5 ConvNets for Classification of Traffic Signs

197

Furthermore, the two middle convolution-pooling layers in the previous network
is divided into two separate layers. There is also a layer connected to the input
which applies linear transformation on each channel separately. Overall, this network
consists of a transformation layer, three convolution-pooling layers and two fully
connected layers with a dropout layer (Hinton 2014) in between. Finally, there is a
Leaky ReLU layer after each convolution layer and after the first fully connected
layer. The network accepts a 48 × 48 RGB image and classify it into one of the 43
traffic sign classes in the GTSRB dataset.
It is worth mentioning that number of the parameters is reduced by dividing
the two middle convolution-pooling layers into two groups. More specifically, the
transformation layer applies an element-wise linear transformation f c (x) = ac x +
bc on cth channel of the input image where ac and bc are trainable parameters and
x is the input value. Note that each channel has a unique transformation function.
Next, the image is processed using 100 filters of size 7 × 7. The notation C(c, k, w)
indicates a convolution layer with k filters of size w × w applied on the input with c
channels. Then, the output of the convolution is passed through a Leaky ReLU layer
and fetched into the pooling layer where a MAX-pooling operation is applied on the
3 × 3 window with stride 2.
In general, a C(c, k, w) layer contains c × k × w × w parameters. In fact, the
second convolution layer accepts a 100-channel input and applies 150 filters of
size 4 × 4. Using this configuration, the number of the parameters in the second
convolution layer would be 2,40,000. The number of the parameters in the second
layer is halved by dividing the input channels into two equal parts and fetching
each part into a layer including two separate C(50, 75, 4) convolution-pooling units.
Similarly, the third convolution-pooling layer halves the number of the parameters
using two C(75, 125, 4) units instead of one C(150, 250, 4) unit. This architecture
is collectively parametrized by 1,123,449 weights and biases which is 27, 22 and 3%
reduction in the number of the parameters compared with the networks proposed in
Ciresan et al. (2012a), Sermanet and Lecun (2011), and Jin et al. (2014), respectively.
Compared with Jin et al. (2014), this network needs less arithmetic operations since
Jin et al. (2014) uses a Local Response Normalization layer after each activation
layer which needs a few multiplications per element in the resulting feature map
from previous layer. The following script shows how to implement this network in
Python using the Caffe library:
def create_net_ircv1(save_to):
L = caffe.layers
P = caffe.params
net, net_valid = gtsrb_source(input_size=48,
mean_file=’/home/pc/gtsr_mean_48x48.binaryproto’)
net.tran = L.Convolution(net.data,
num_output=3,
group=3,
kernel_size=1,
weight_filler={’type’:’constant’,
’value’:1},
bias_filler={’type’:’constant’,
’value’:0},
param=[{’decay_mult’:1},{’decay_mult’:0}])
net.conv1, net.act1, net.pool1 = conv_act_pool(net.tran, 7, 100, act=’ReLU’)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

198

5 Classification of Traffic Signs

net.conv2, net.act2, net.pool2 = conv_act_pool(net.pool1, 4, 150, act=’ReLU’, group=2)
net.conv3, net.act3, net.pool3 = conv_act_pool(net.pool2, 4, 250, act=’TanH’, group=2)
net.fc1, net.fc_act, net.drop1 = fc_act_drop(net.pool3, 300, act=’ReLU’)
net.f3_classifier = fc(net.drop1, 43)
net.loss = L.SoftmaxWithLoss(net.f3_classifier, net.label)
net.acc = L.Accuracy(net.f3_classifier, net.label)

16
17
18
19
20
21
22

with open(save_to, ’w’) as fs:
s_proto = str(net_valid.to_proto()) + ’\n’ + str(net.to_proto())
fs.write(s_proto)
fs.flush()
print s_proto

23
24
25
26
27

The above network is trained using the same training/validation procedure.
Figure 5.12 shows the training/validation curve of this network. Although the number
of parameters is reduced compared with Fig. 5.9, the network still accurately classifies the traffic signs in the GTSRB dataset. In general after finding a model which
produces results very close to expected accuracy, it is likely to be able to reduce the
model complexity by keeping the accuracy unaffected.
It is a common practice to inspect extracted features of a ConvNet using a visualization technique. The general procedure is to first train the network. Then, some
samples are fed to the network and feature vectors extracted by a specific layer on all
samples are collected. Assuming that each feature vector is a D dimensional vector,
a feature generated by this layer will be a point in the D dimensional space. These
D dimensional samples can be embedded into a two-dimensional space.
Embedding into two-dimensional space can be done using principal component
analysis, self-organizing maps, isomaps, locally-linear embedding, and etc. One of
the embedding methods which produces promising results is called t-distributed
stochastic neighbor embedding (t-SNE) (Maaten and Hinton 2008). It nonlinearly
embeds points into a lower dimensional (in particular two or three dimensional) space
by preserving structure of neighbors as much as possible. This is an important property since it adequately reflects the neighborhood structure in the high dimensional
space.

Fig. 5.12 Training/validation curve on the network illustrated in Fig. 5.11

5.5 ConvNets for Classification of Traffic Signs

199

Fig. 5.13 Compact version of the network illustrated in Fig. 5.11 after dropping the first fully
connected layer and the subsequent Leaky ReLU layer

Figure 5.11 shows the two-dimensional embedding of feature maps after the second and third pooling layers as well as the first fully connected layer. The embedding
is done by using the t-SNE method. Samples of each class is represented using a
different color. According to the embedding result, the classes are separated properly after the third pooling layer. This implies that the classification layer might be
able to accurately discriminate the classes if we omit the first fully connected layer.
According to the t-SNE visualization, the first fully connected layer does not increase
the discrimination of the classes, considerably. Instead, it rearranges the classes in a
lower dimensional space and it might mainly affect the interclass distribution of the
samples.
Consequently, it is possible to discard the first fully connected layer and the
subsequent Leaky ReLU layer from the network and connect the third pooling layer
directly to the dropout layer. The more compact network is shown in Fig. 5.13.
From optimization perspective, this decreases the number of the parameters from
1,123,449 to 531,999 which is 65, 63, and 54% reduction compared with Ciresan
et al. (2012a), Sermanet and Lecun (2011), and Jin et al. (2014), respectively.

5.6 Ensemble of ConvNets
Given a training set X = {x1 , . . . , x N }, we denote a model trained on this set using
M . Assuming that M is a model for classifying input xi into one of K classes,
M (x) ∈ R K returns the per class score (output of classification layer without applying the softmax function) of the model for input x. The main idea behind ensemble

200

5 Classification of Traffic Signs

learning is to train L models M1 . . . M L on X and predict the class of sample xi
by combining the models using
Z (M1 (xi ), . . . , M L (xi )).

(5.2)

In this equation, Z is a functions which accepts classification scores of samples
xi predicted by L models and combines these scores in order to classify xi . Previous
studies (Ciresan et al. 2012a; Jin et al. 2014; Sermanet et al. 2013; Aghdam et al.
2016a) show that creating an ensemble of ConvNets might increase the classification
accuracy. In order to create an ensemble, we have to answer two questions. First,
how to combine the predictions of made by models on sample xi ? Second, how to
train L different models on X ?

5.6.1 Combining Models
First step in creating an ensemble is to design a method for combining classification
scores predicted by different models on the same sample. In this sections, we will
explain some of these methods.

5.6.1.1 Voting
In this approach, first, class labels are predicted by each model. This way, each
model votes for its class label. Then, the function Z returns the class by combining
all votes. One method for combining votes is to count the votes for each class and
return the class with majority of votes. This technique is called majority voting. We
may add more restrictions to this algorithm. For example, if the class with majority
of votes does not have the minimum number of votes, the algorithm may return a
value indicating that the samples cannot be classified with high confidence.
In the above approach, all models have the same impact in voting. It is also possible
to give weight for each model so that votes of models are counted according to their
weight. This method is called weighted voting. For example, if the weight of a model
is equal to 3, its vote is counted three times. Then, majority of votes is returned taking
into account the weights of each model. This technique is not widely used in neural
networks.

5.6.1.2 Model Averaging
Model averaging is commonly used in creating ensemble with neural networks. In
this approach, the function Z is defined as:
Z (M1 (xi ), . . . , M L (xi )) =

L

j=1

α j M (xi )

(5.3)

5.6 Ensemble of ConvNets

201

where α j is the weight of j th model. If we set all α j = L1 , j = 1 . . . L the above functions simply computes the average of classification scores. If each model is assigned
with a different weight the above function will compute the weighted average of
classification scores. Finally, the class of samples xi is analogous to the index in Z
with maximum value.
ConvNets can have low bias and high variance by increasing the number of layers
and number of neurons in each layer. This means that the model has a higher chance
to overfit on training data. However, the core idea behind model averaging is that
computing average of many models with low bias and high variance represents a
model with low bias and low variance on data which in turn increases the accuracy
and generalization of the ensemble.

5.6.1.3 Stacking
Stacking is the generalized version of model averaging. In this method, Z is another
function that learns how to combine the classification scores predicted by L models.
This function can be a linear function such as weighted averaging or it can be a
nonlinear function such as a neural network.

5.6.2 Training Different Models
The second step in creating an ensemble is to train L different models. The easiest
way to achieve this goal is to sample the same model during the same training phase
but in different iterations. For example, we can save the weights at 1000th , 5000th ,
15, 000th , and 40, 000th iterations in order to create four different models. Another
way is to initialize the same model L times and execute the training procedure L
times in order to obtain L models with different initializations. This method is more
common than the former method. More general setting is to designs L networks with
different architectures and train them on the training set. The previous two methods
can be formulated as special cases of this method.

5.6.2.1 Bagging and Boosting
Bagging is a technique that can be used in training different models. In this technique,
L random subsets of training set X are generated. Then, a model is trained on each
subset independently. Clearly, some of samples might appear in more than one subset.
Boosting starts with assigning equal weights for each samples in the training set.
Then, a model is trained taking into account the weight of each samples. After that,
samples in the training set are classified using the model. The weights of correctly
classified samples is reduced and the weights of incorrectly classified samples is
increased. Then, second model is trained on the training set using the new weights.
This procedure is repeated L times yielding L different models. Boosting using
neural networks is not common and it is mainly used for creating ensemble using
weak classifiers such as decision stumps.

202

5 Classification of Traffic Signs

5.6.3 Creating Ensemble
Works in Ciresan et al. (2012a), Jin et al. (2014) utilize the model averaging technique
in which the average score of several ConvNets is computed. However, there are two
limitations with this approach. First, sometimes, the classification accuracy of an
ensemble might not be improved substantially compared with a single ConvNet in
the ensemble. This is due to the fact that these ConvNets might have ended up in
the same local minimum during the training process. As the result, their scores are
very similar and their combination does not change the posterior of the classes. The
second problem is that there might be some cases where adding a new ConvNet to
the ensemble reduces the classification performance. One possibility is that the belief
about a class posteriors of the new ConvNet is greatly different from the belief of the
ensemble. Consequently, the new ConvNet changes the posterior of the ensemble
dramatically which in turn reduces the classification performance.
To alleviate this problem, Ciresan et al. (2012a) and Jin et al. (2014) create ensembles consisting of many ConvNets. The idea is that the number of the ConvNets which
contradicts the belief of the ensemble is less than the number of the ConvNets which
increase the classification performance. Therefore, the overall performance of the
ensemble increases as we add more ConvNets.
While the idea is generally correct but it posses a serious problem in practical
applications. Concretely, an ensemble with many ConvNets needs more time to
classify the input image. One solution to this problem is to formulate the ensemble
construction as a LASSO regression (Tibshirani 1994) problem. Formally, given the
j
classification score vector Li of i th ConvNet computed on j th image, our goal is
to find coefficients ai by minimizing the following error function:
E=

M


y j −

j=1

N


j

ai Li  − λ

i=1

N


|ai |

(5.4)

i=1

where M is the total number of the test images, N is the number of the ConvNets in
the ensemble, and λ is a user-defined value to determine the amount of sparseness. It
is well-studied that L 1 norm regularization produces a sparse vector in which most of
the coefficients ai are zero. Thus, the ConvNets corresponding to these coefficients
can be omitted from the ensemble. The remaining ConvNets are linearly combined
according to their corresponding ai value. Determining the correct value for λ is an
empirical task and it might need many trials. More specifically, small values for λ
retains most of the ConvNets in the ensemble. Conversely, increasing the value of λ
drops more ConvNets from the ensemble.
Another method is to formulate the ensemble construction as the optimal subset
selection problem by solving the following optimization problem (Aghdam et al.
2016a, b):
⎡

⎤
M


1
j
δ y j − arg max
Li ⎦ − λ|I |
(5.5)
arg min ⎣
M
I ⊂{1,...,N }
j=1

i∈I

5.6 Ensemble of ConvNets

203

where the arg max function returns the index of the maximum value in the clasj
sification score vector Li and y j is the actual class. The first term calculates the
classification accuracy of the selected subset of ConvNets over the testing dataset
and the second term penalizes the classification accuracy based on the cardinality of
set I . In other words, we are looking for a subset of N ConvNets where classification
accuracy is high and the number of the ConvNets is as few as possible. In contrast
to the LASSO formulation, selecting the value of λ is straightforward. For example,
assume two subsets I1 and I2 including 4 and 5 ConvNets, respectively. Moreover,
consider that the classification accuracy of I1 is 0.9888 and the classification accuracy of I2 is 0.9890. If we set λ = 3e−4 and calculate the score using (5.5), their
score will be equal to 0.9876 and 0.9875. Thus, despite its higher accuracy, the subset
I2 is not better than the subset I1 because adding an extra ConvNet to the ensemble
improves the accuracy 0.02% which is discarded by the penalizing. However, if we
choose λ = 2e−4 , the subset I2 will have a higher score compared with the subset
I1 . In sum, λ shows what is the expected minimum accuracy increase that a single
ConvNet must cause in the ensemble. The above objective function can be optimized
using an evolutionary algorithm such as Genetic algorithm.
In Aghdam et al. (2016b), genetic algorithms with population of 50 chromosomes
is used for finding the optimal subset. Each chromosome in this method is encoded
using the N-bit binary coding scheme. A gene with value 1 indicates the selection
of the corresponding ConvNet in the ensemble. The fitness of each chromosome is
computed by applying (5.5) on the validation set. The offspring is selected using the
tournament selection operator with tournament size 3. The crossover operators are
single-point, two-point, and uniform in which one of them is randomly applied in
each iteration. The mutation operator flips the gene of a chromosome with probability
p = 0.05. Finally, we also apply the elitism (with elite count = 1) to guarantee that
the algorithm will not forget the best answer. Also, this can contribute for faster
convergence by using the best individual so far during the selection process in the
next iteration which may generate better answers during.

5.7 Evaluating Networks
We explained a few methods for evaluating classification models including ConvNets. In this section, we provide different techniques that can be used for analyzing
ConvNets. To this end, we trained the network shown in Fig. 5.11 and its compact
version 10 times and evaluated using the test set provided in the GTSRB dataset.
Table 5.1 shows the results. The average classification accuracy of the 10 trials is
98.94 and 98.99% for the original ConvNet and its compact version, respectively,
which are both above the average human performance reported in Stallkamp et al.
(2012). In addition, the standard deviations of the classification accuracy is small
which show that the proposed architecture trains the networks with very close accuracies. We argue that this stability is the result of reduction in the number of the

204

5 Classification of Traffic Signs

Table 5.1 Classification
accuracy of the single
network. Above The proposed
network in Aghdam et al.
Aghdam et al. (2016a) and
below its compact version

Aghdam et al. (2016a) (original)
Trial

Top 1 acc. (%)

Top 2 acc. (%)

1

98.87

99.62

2

98.98

99.64

3

98.85

99.62

4

98.98

99.58

5

98.99

99.63

6

99.06

99.75

7

98.99

99.66

8

99.05

99.70

9

98.88

99.57

10

98.77

99.60

Average

98.94 ± 0.09

99.64 ± 0.05

Human

98.84

NA

Ciresan et al. (2012a)

98.52 ± 0.15

NA

Jin et al. (2014)

98.96 ± 0.20

NA

Aghdam et al. (2016a) (compact)
Trial

Top 1 acc. (%)

Top 2 acc. (%)

1

99.11

99.63

2

99.06

99.64

3

98.88

99.62

4

98.97

99.61

5

99.08

99.66

6

98.94

99.68

7

98.87

99.60

8

98.98

99.65

9

98.92

99.61

10

99.05

99.63

Average

98.99 ± 0.08

99.63 ± 0.02

Human

98.84

NA

Ciresan et al. (2012a)

98.52 ± 0.15

NA

Jin et al. (2014)

98.96 ± 0.20

NA

parameters and regularizing the network using a dropout layer. Moreover, we observe
that the top-28 accuracies are very close in all trials and their standard deviations

8 The

percent of the samples which are always within the top 2 classification scores.

5.7 Evaluating Networks

205

are 0.05 and 0.02 for the original ConvNet and its compact version, respectively. In
other words, although the difference in the top-1 accuracies of the Trial 1 and the
Trial 2 in the original network is 0.11%, notwithstanding, the same difference for
top-2 accuracy is 0.02%. This implies that there are images that are classified correctly in Trial 1 and they are misclassified in Trial 2 (or vice versa) but they are always
within the top-2 scores of both networks. As a consequence, if we fuse the scores
of the two networks the classification accuracy might increase. The same argument
is applied on the compact network, as well. Compared with the average accuracies
of the single ConvNets proposed in Ciresan et al. (2012a) and Jin et al. (2014), the
architecture in Aghdam et al. (2016a) and its compact version are more stable since
their standard deviations are less than the standard deviations of these two ConvNets.
In addition, despite the fact that the compact network has 52% fewer parameters than
the original network, the accuracy of the compact network is more than the original
network and the two other networks. This confirms the claim illustrated by t-SNE
visualization in Fig. 5.11 that the fully connected layer in the original ConvNet does
not increase the separability of the traffic signs. But, the fact remains that the compact
network has less degree of freedom compared with the original network. Taking into
account the Bias-Variance decomposition of the original ConvNet and the compact
ConvNet, Aghdam et al. (2016a) claim that the compact ConvNet is more biased
and its variance is less compared with the original ConvNet. To prove this, they
created two different ensembles using the algorithm mentioned in Sect. 5.6.3. More
specifically, one ensemble was created by selecting the optimal subset from a pool
of 10 original ConvNets and the second pool was created in the same way but from
a pool of 10 compact ConvNets. Furthermore, two other ensembles were created
by utilizing the model averaging approach (Ciresan et al. 2012a; Jin et al. 2014). in
which each ensemble contains 10 ConvNets. Tables 5.2 and 5.3 show the results and
compare them with three other state-of-art ConvNets.
First, we observe that ensemble creating based on optimal subset selection method
is more efficient than the model averaging approach. To be more specific, the
ensemble created by selecting optimal subset of the ConvNets needs 50% less

Table 5.2 Comparing the classification performance of the ensembles created by model averaging
and our proposed method on the pools of original and compact ConvNets proposed in Aghdam
et al. (2016a) with three state-of-art ConvNets
Name
Ens. of original ConvNets
Ens. of original ConvNets (avg.)
Ens. of compact ConvNets

No. of ConvNets

Accuracy (%)

F1 -score

5

99.61

0.994

10

99.56

0.993

2

99.23

0.989

Ens. of compact ConvNets (avg.)

10

99.16

0.987

Ciresan et al. (2012a)

25

99.46

NA

1

98.97

NA

20

99.65

NA

Sermanet and Lecun (2011)
Jin et al. (2014)

206

5 Classification of Traffic Signs

Table 5.3 Comparing the run-time efficiency of the ensembles created by model averaging and
optimal subset selection method on pools of original and compact ConvNets with three state-of-art
ConvNets. Note that we have calculated the worst case by considering that every LReLU unit will
perform one multiplications. In contrast, we have computed the minimum number of multiplications
in Ciresan et al. (2012a) by assuming that the input of tanh function always falls in range [−0.5, 0.5].
Similarly, in the case of Jin et al. (2014), we have considered fast but inaccurate implementation of
pow(float, float)
Name
Ens. of original ConvNets
Ens. of original ConvNets (avg.)
Ens. of compact ConvNets

No. of
ConvNets

No. of
parameters

No. of multiplications

5

1,123,449

382,699,560

10

1,123,449

765,399,120

2

531,999

151,896,924

Ens. of compact ConvNets (avg.)

10

531,999

759,484,620

Ciresan et al. (2012a)

25

1,543,443

3,208,042,500

1

1,437,791

NA

20

1,162,284

1,445,265,400

Sermanet and Lecun (2011)
Jin et al. (2014)

multiplications9 and its accuracy is 0.05% higher compared with the ensemble created by averaging all the original ConvNets in the pool (10 ConvNets). Note that the
number of ConvNets in the ensemble directly affects the number of the arithmetic
operations required for making predictions. This means that the model averaging
approach consumes double CPU cycles compared with optimal subset ensemble.
Moreover, the ensemble created by optimal subset criteria reduces the number of
the multiplications 88 and 73% compared with the ensembles proposed in Ciresan
et al. (2012a) and Jin et al. (2014), respectively. More importantly, the dramatic
reduction in the number of the multiplications causes only five more misclassification
(0.04% less accuracy) compared with the results obtained by the ensemble in Jin et al.
(2014). We also observe that the ensemble in Ciresan et al. (2012a) makes 19 more
mistakes (0.15% more misclassification) compared with optimal subset ensemble.
Besides, the number of the multiplications of the network proposed in Sermanet
and Lecun (2011) is not accurately computable since its architecture is not clearly
mentioned. However, the number of the parameters of this ConvNet is more than the

9 We calculated the number of the multiplications of a ConvNet taking into account the number of the

multiplications for convolving the filters of each layer with the N-channel input from the previous
layer, number of the multiplications required for computing the activations of each layer and the
number of the multiplications imposed by normalization layers. We previously explained that tanh
function utilized in Ciresan et al. (2012a) can be efficiently computed using 10 multiplications.
ReLU activation used in Jin et al. (2014) does not need any multiplications and Leaky ReLU units
in Aghdam et al. (2016a) computes the results using only 1 multiplication. Finally, considering that
pow(float, float) function needs only 1 multiplication and 64 shift operations (tinyurl.com/yehg932),
the normalization layer in Jin et al. (2014) requires k × k + 3 multiplications per each element in
the feature map.

5.7 Evaluating Networks

207

ConvNet in Aghdam et al. (2016a). In addition, it utilizes rectified sigmoid activation
which needs 10 multiplications per each element in the feature maps. In sum, we
can roughly conclude that the ConvNet in Sermanet and Lecun (2011) needs more
multiplications. However, we observe that an ensemble of two compact ConvNets
performs better than Sermanet and Lecun (2011) and, yet, it needs less multiplications
and parameters.
Finally, although the single compact ConvNet performs better than single original
ConvNet, nonetheless, the ensemble of compact ConvNets does not perform better.
In fact, according to Table 5.2, an ensemble of two compact ConvNets shows a better
performance compared with the ensemble of 10 compact ConvNets. This is due to
the fact that the compact ConvNet is formulated with much fewer parameters and
it is more biased compared with the original ConvNet. Consequently, their representation ability is more restricted. For this reason, adding more ConvNets to the
ensemble does not increase the performance and it always vary around 99.20%. In
contrary, the original network is able to model more complex nonlinearities so it is
less biased about the data and its variance is more than the compact network. Hence,
the ensemble of the original networks posses more discriminative representation
which increases its classification performance. In sum, if run-time efficiency is more
important than the accuracy, then, ensemble of two compact ConvNets is a good
choice. However, if we need more accuracy and the computational burden imposed
by more multiplications in the original network is negligible, then, the ensemble of
the original ConvNets can be utilized.
It is worth mentioning that the time-to-completion (TTC) of ConvNets does not
solely depend on the number of multiplications. Number of accesses to memory also
affects the TTC of ConvNets. From the ConvNets illustrated in Table 5.3, a single
ConvNet proposed in Jin et al. (2014) seems to have a better TTC since it needs
less multiplications compared with Aghdam et al. (2016a) and its compact version.
However, Jin et al. (2014, Table IX) shows that this ConvNets needs to pad the feature
maps before each convolution layer and there are three local response normalization
layers in this ConvNet. For this reason, it need more accesses to memory which can
negatively affect the TTC of this ConvNets. To compute the TTC of these ConvNets in
practice, we ran the ConvNets on both CPU (Inter Core i7-4960), and GPU (GeForce
GTX 980). The hard disk was not involved in any other task and there were no
running application or GPU demanding processes. The status of the hardware was
fixed during the calculation of the TTC of ConvNets. Then, the average TTC of the
forward-pass of every ConvNet was calculated by running each ConvNet 200 times.
Table 5.4 shows the results in the scale of milliseconds for one forward-pass.
The results show that the TTC of single ConvNet proposed in Jin et al. (2014)
is 12 and 37% more than Aghdam et al. (2016a) when it runs on CPU and GPU,
respectively. This is consistent with our earlier discussion that the TTC of ConvNets
does not solely depend on arithmetic operations. But, the number of memory accesses
affects the TTC. Also, the TTC of the ensemble of Aghdam et al. (2016a) is 78 and
81% faster than the ensemble proposed in Jin et al. (2014).

208

5 Classification of Traffic Signs

Table 5.4 Benchmarking time-to-completion of Aghdam et al. (2016a) along with its compact
ConvNet and Jin et al. (2014) obtained by running the forward-pass of each ConvNet 200 times and
computing the average time for completing the forward-pass
Aghdam et al. (2016a)

Aghdam et al. (2016a)
(compact)

Jin et al. (2014)

CPU

12.96 ms

12.47 ms

14.47 ms

GPU

1.06 ms

1.03 ms

1.45 ms

Aghdam et al. (2016a) ens.

Aghdam et al. (2016a) ens.
(compact)

Jin et al. (2014) ens.

CPU

5 × 12.96 = 64.8 ms

2 × 12.47 = 24.94 ms

20 × 14.47 = 289.4 ms

GPU

5 × 1.06 = 5.30 ms

2 × 1.03 = 2.06 ms

20 × 1.45 = 29.0 ms

Table 5.5 Class-specific precision and recall obtained by the network in Aghdam et al. (2016a).
Bottom images show corresponding class label of each traffic sign. The column support (sup) shows
the number of the test images for each class

Class precision recall sup Class precision recall sup Class precision recall sup
1.00
0
1.00
1.00 60
15
1.00 210
30
1.00
0.97 150
1
1.00
1.00 720
16
1.00
1.00 150
31
1.00
0.99 270
2
1.00
1.00 750
17
1.00
1.00 360
32
1.00
1.00 60
3
1.00
0.99 450
18
1.00
0.99 390
33
1.00
1.00 210
4
1.00
0.99 660
19
0.97
1.00 60
34
1.00
1.00 120
5
0.99
1.00 630
20
0.99
1.00 90
35
1.00
1.00 390
6
1.00
0.98 150
21
0.97
1.00 90
36
0.98
1.00 120
7
1.00
1.00 450
22
1.00
1.00 120
37
0.97
1.00 60
8
1.00
1.00 450
23
1.00
1.00 150
38
1.00
1.00 690
9
1.00
1.00 480
24
0.99
0.99 90
39
0.98
0.98 90
10
1.00
1.00 660
25
1.00
0.99 480
40
0.97
0.97 90
11
0.99
1.00 420
26
0.98
1.00 180
41
1.00
1.00 60
12
1.00
1.00 690
27
0.97
1.00 60
42
0.98
1.00 90
13
1.00
1.00 720
28
1.00
1.00 150
14
1.00
1.00 270
29
1.00
1.00 90

5.7.1 Misclassified Images
We computed the class-specific precision and recall (Table 5.5). Besides, Fig. 5.14
illustrates the incorrectly classified traffic signs. The blue and red numbers below
each image show the actual and predicted class labels, respectively. For presentation
purposes, all images were scaled to a fixed size. First, we observe that there are
four cases where the images are incorrectly classified as class 11 while the true
label is 30. Particularly, three of these cases are low-resolution images with poor
illuminations. Moreover, class 30 is distinguishable from class 11 using the fine
differences in the pictograph. However, rescaling a poorly illuminated low-resolution

5.7 Evaluating Networks

209

Fig. 5.14 Incorrectly classified images. The blue and red numbers below each image show the
actual and predicted class labels, respectively. The traffic sign corresponding to each class label is
illustrated in Table 5.5

image to 48 × 48 pixels causes some artifacts on the image. In addition, two of these
images are inappropriately localized and their bounding boxes are inaccurately. As
the result, the network is not able to discriminate these two classes on these images. In
addition, by inspecting the rest of the misclassified images, we realize that the wrong
classification is mainly due to occlusion of pictograph or low-quality of the images.
However, there are a few cases where the main reason of the misclassification is due
to inaccurate localization of the traffic sign in the detection stage (i.e., inaccurate
bounding box).

5.7.2 Cross-Dataset Analysis and Transfer Learning
So far, we trained a ConvNet on the GTSRB dataset and achieved state-of-art results
with much fewer arithmetic operations and memory accesses which led to a considerably faster approach for classification of traffic signs. In this section, we inspect
how transferable is this ConvNet across different datasets. To this end, we first evaluate the cross-dataset performance of the network. To be more specific, we use the
trained ConvNet to predict the class of the traffic signs in the Belgium traffic sign
classification (BTSC) dataset (Radu Timofte 2011) (Fig. 5.15).
We inspected the dataset to make it consistent with the GTSRB. For instance, Class
32 in this dataset contains both signs “speed limit 50” and “speed limit 70”. However,
these are two distinct classes in the GTSRB dataset. Therefore, we separated the
overlapping classes in the BTSC dataset according to the GTSRB dataset. Each
image in the BTSC dataset contains one traffic sign and it totally consists of 4,672
color images for training and 2,550 color images for testing. Finally, we normalize
the dataset using the mean image obtained from the GTSRB dataset and resize all
the images to 48 × 48 pixels.
Among 73 classes in the BTSC dataset (after separating the overlapping classes),
there are 23 common classes with the GTSRB dataset. We applied our ConvNet
trained on the GTSRB dataset to classify these 23 classes inside both the training set

210

5 Classification of Traffic Signs

Fig. 5.15 Sample images from the BTSC dataset

and the testing set in the BTSC dataset. Table 5.6 shows the class-specific precision
and recall.
In terms of accuracy, the trained network has correctly classified 92.12% of samples. However, precisions and recalls reveal that the classification of class 29 is
worse than a random guess. To find out the reason, we inspect the misclassified
images illustrated in Fig. 5.16.
Comparing the class 29 in the BTSC dataset with its corresponding class in the
GTSRB (Table 5.5) shows that the pictograph of this class in the GTSRB dataset has
significant differences with the pictograph of the same class in the BTSC dataset.
In general, the misclassified images are mainly due to pictograph differences, perspective variation, rotation and blurriness of the images. We inspected the GTSRB
dataset and found that perspective and rotation is more controlled than the BTSC
dataset. As the result, the trained ConvNet has not properly captured the variations
caused by different perspectives and rotations on the traffic signs. In other words, if
we present adequate amount of data covering different combinations of perspective
and rotation, the ConvNet might be able to accurately model the traffic signs in the
BTSC dataset.
To prove that, we try to find out how transferable is the ConvNet. We follow
the same procedure mentioned in Yosinski et al. (2014) and evaluate the degree of
transferability of the ConvNet in different stages. Concretely, the original ConvNet
is trained on the GTSRB dataset. The Softmax loss layer of this network consists
of 43 neurons since there are only 43 classes in the GTSRB dataset. We can think
of the transformation layer up to the L ReLU4 layer as a function which extracts
the features of the input image. Thus, if this feature extraction algorithm performs
accurately on the GTSRB dataset, it should also be able to model the traffic signs

5.7 Evaluating Networks

211

Table 5.6 Cross-dataset evaluation of the trained ConvNet using the BTSC dataset. Class-specific
precision and recall obtained by the network are shown. The column support (sup) shows the number
of the test images for each class. Classes with support equal to zero do not have any test cases in
the BTSC dataset
Class Precision

Recall

Sup

Class

Precision

Recall

Sup

Class

Precision

Recall

Sup

0

NA

NA

0

15

0.91

0.86

167

30

NA

NA

0

1

NA

NA

0

16

1.00

0.78

45

31

NA

NA

0

2

NA

NA

0

17

1.00

0.93

404

32

NA

NA

0

3

NA

NA

0

18

0.99

0.93

125

33

NA

NA

0

4

1.00

0.93

481

19

1.00

0.90

21

34

NA

NA

0

5

NA

NA

0

20

0.93

0.96

27

35

0.92

1.00

96

6

NA

NA

0

21

0.92

0.92

13

36

1.00

0.83

18

7

NA

NA

0

22

0.72

1.00

21

37

NA

NA

0

8

NA

NA

0

23

1.00

0.95

19

38

NA

NA

0

9

0.94

0.94

141

24

0.66

1.00

21

39

NA

NA

0

10

NA

NA

0

25

0.90

1.00

47

40

0.99

0.87

125

11

0.88

0.91

67

26

0.75

0.86

7

41

NA

NA

0

12

0.97

0.95

382

27

NA

NA

0

42

NA

NA

0

13

0.97

0.99

380

28

0.89

0.91

241

14

0.87

0.95

86

29

0.19

0.08

39

Overall accuracy: 92.12%

Fig. 5.16 Incorrectly classified images from the BTSC dataset. The blue and red numbers below
each image show the actual and predicted class labels, respectively. The traffic sign corresponding
to each class label is illustrated in Table 5.5

in the BTSC dataset. To evaluate the generalization power of the ConvNet trained
only on the GTSRB dataset, we replace the Softmax layer with a new Softmax layer
including 73 neurons to classify the traffic signs in the BTSC dataset. Then, we
freeze the weights of all the layers except the Sofmax layer and run the training

212

5 Classification of Traffic Signs

algorithm on the BTSC dataset to learn the weights of the Softmax layer. Finally, we
evaluate the performance of the network using the testing set in the BTSC dataset.
This empirically computes how transferable is the network in Aghdam et al. (2016a)
on other traffic signs datasets.
It is well studied that the first layer of a ConvNet is more general and the last
layer is more class specific. This means that the FC1 layer in Fig. 5.11) is more
specific than the C3 layer. In other words, the FC1 layer is adjusted to classify the
43 traffic signs in the GTSRB dataset. As the result, it might not be able to capture
every aspects in the BTSC dataset. If this assumption is true, then we can adjust
the weights in the FC1 layer beside the Softmax layer so it can model the BTSC
dataset more accurately. Then, by evaluating the performance of the ConvNet on the
testing set of the BTSC dataset we can find out to what extend the C3 layer is able
to adjust on the BTSC dataset. We increasingly add more layers to be adjusted on
the BTSC dataset and evaluate their classification performance. At the end, we have
five different networks with the same configuration but different weight adjustment
procedures on the BTSC dataset. Table 5.7 shows the weights which are fixed and
adjusted in each network. We repeated the training 4 times for each row in this table.
Figure 5.17 shows the results.
First, we observe that when we only adjust the softmax layer (layer 5) and freeze
the previous layers, the accuracy drops dramatically compared with the results in the
GTSRB dataset. In the one hand, layer 4 is adjusted such that the traffic signs in
the GTSRB dataset become linearly separable and they can be discriminated using the
linear classifier in the softmax layer. On the other hand, the number of the traffic signs
in the BTSC dataset is increased 70% compared with GTSRB dataset. Therefore,
layer 4 is not able to linearly differentiate fine details of the traffic signs in the
BTSC dataset. This is observable from the t-SNE visualization of the L ReLU4 layer
corresponding to n = 5 in Fig. 5.17. Consequently, the classification performance
drops because of overlaps between the classes.
If the above argument is true, then, fine-tuning the layer 4 beside the layer 5
must increase the performance. Because, by this way, we let the L ReLU4 layer to
be adjusted on the traffic signs included in the BTSC dataset. We see in the figure
that adjusting the layer 4 (n = 4) and the layer 5 (n = 5) increases the classification

Table 5.7 Layers which are frozen and adjusted in each trial to evaluate the generality of each
layer
ConvNet No. Trans.

Conv1
layer 1

Conv2
layer 2

Conv3
layer 3

FC1
layer 4

Softmax
layer 5

1

Fixed

Fixed

Fixed

Fixed

Fixed

Adjust

2

Fixed

Fixed

Fixed

Fixed

Adjust

Adjust

3

Fixed

Fixed

Fixed

Adjust

Adjust

Adjust

4

Fixed

Fixed

Adjust

Adjust

Adjust

Adjust

5

Fixed

Adjust

Adjust

Adjust

Adjust

Adjust

5.7 Evaluating Networks

213

Fig. 5.17 The result of fine-tuning the ConvNet on the BTSC dataset that is trained using GTSRB
dataset. Horizontal axis shows the layer n at which the network starts the weight adjustment. In
other words, weights of the layers before the layer n are fixed (frozen). The weights of layer n and
all layers after layer n are adjusted on the BTSC dataset. We repeated the fine-tuning procedure
4 times for each n ∈ {1, . . . , 5}, separately. Red circles show the accuracy of each trial and blue
squares illustrate the mean accuracy. The t-SNE visualizations of the best network for n = 3, 4, 5
are also illustrated. The t-SNE visualization is computed on the L ReLU4 layer

accuracy from 97.65 to 98.44%. Moreover, the t-SNE visualization corresponding
to n = 4 reveals that the traffic signs classes are more separable compared with
the result from n = 5. Thus, adjusting both L ReLU4 and Softmax layers make the
network more accurate for the reason we mentioned above.
Recall from Fig. 5.11 that L ReLU4 was not mainly responsible for increasing
the separability of the classes. Instead, we saw that this layer mainly increases the
variance of the ConvNet and improves the performance of the ensemble. In fact,
we showed that traffic signs are chiefly separated using the last convolution layer.
To further inspect this hypothesis, we fine-tuned the ConvNet on the BTSC dataset
starting from layer 3 (i.e., the last convolution layer). Figure 5.17 illustrate an increase
up to 98.80% in the classification accuracy. This can be also seen on the t-SNE
visualization corresponding to the layer 3 where traffic signs of the BTSC dataset
become more separable when the ConvNet is fine-tuned starting from the layer 3.
Interestingly, we observe a performance reduction when the weights adjustment
starts from layer 2 or layer 1. Specifically, the mean classification accuracy drops

214

5 Classification of Traffic Signs

from 98.80% in layer 3 to 98.67 and 98.68% in layer 2 and layer 1, respectively.
This is due to the fact that the first two layers are more general and they do not
significantly change from the GTSRB to the BTSC dataset. In fact, these two layers
are trained to detect blobs and oriented edges. However, because the number of data
is very few in the BTSC dataset compared with the number of the parameters in the
ConvNet, hence, it adversely modifies the general filters in the first two layers which
consequently affects the weight of the subsequent layers. As the result, the ConvNet
overfits on data and does not generalize well on the test set. For this reason, the
accuracy of the network drops when we fine-tune the network starting from layer 1
or layer 2.
Finally, it should be noted that 98.80 accuracy is obtained using only a single network. As we showed earlier, creating an ensemble of these networks could improve
the classification performance. In sum, the results obtained from cross-dataset analysis and transferability evaluation reveals that the network is able to model a wide
range of traffic signs and in the case of new datasets it only needs to be fine-tuned
starting from the last convolution layer.

5.7.3 Stability of ConvNet
A ConvNet is a nonlinear function that transforms a D-dimensional vector into a
K-dimensional vector in the layer before the classification layer. Ideally, small
changes in the input should produce small changes in the output. In other words,
if image f ∈ R M×N is correctly classified as c using the ConvNet, then, the image
g = f + r obtained by adding a small degradation r ∈ R M×N to f must also be
classified as c.
However, f is strongly degraded as r (norm of r ) increases. Therefore, at a
certain point, the degraded image g is not longer recognizable. We are interested
in finding r with minimum r  that causes the g and f are classified differently.
Szegedy et al. (2014b) investigated this problem and they proposed to minimize the
following objective function with respect to r :
minimi ze λ|r | + scor e( f + r, l)
s.t

f + r ∈ [0, 1] M×N

(5.6)

where l is the actual class label, λ is the regularizing weight, and scor e( f + r, l)
returns the score of the degraded image f + r given the actual class of image f .
In the ConvNet, the classification score vector is 43 dimensional since there are
only 43 classes in the GTSRB dataset. Denoting the classification score vector by
L ∈ [0, 1]43 , L [k] returns the score of the input image for class k. The image is
classified correctly if c = arg maxL = l where c is the index of the maximum value
in the score vector L . If max(L ) = 0.9, the ConvNet is 90% confident that the input
image belongs to class c. However, there might be an image where max(L ) = 0.3.
This means that the image belongs to class c with probability 0.3. If we manually
inspect the scores of other classes we might realize that L [c2 ] = 0.2, L [c3 ] =

5.7 Evaluating Networks

215

0.2, L [c4 ] = 0.2, and L [c5 ] = 0.1 where ci depicts the i th maximum in the score
vector L .
Conversely, assume two images that are misclassified by the ConvNet. In the first
image, L [l] = 0.1 and L [c] = 0.9 meaning that the ConvNet believes the input
image belongs to class l and class c with probabilities 0.1 and 0.9, respectively. But,
in the second image, the beliefs of the ConvNet are L [l] = 0.49 and L [c] = 0.51.
Even tough in both cases the images are misclassified, the degrees of misclassification
are different.
One problem with the objective function (5.6) is that it finds r such that scor e( f +
r, l) approaches to zero. In other words, it finds r such that L [l] = ε and L [c] =
1 − ε. Assume the current state of the optimization function is rt where L [l] = 0.3
and L [c] = 0.7. In other words, the input image f is misclassified using the current
degradation rt . Yet, the goal of the objective function (5.6) is to settle in a point where
scor e( f + rt , l) = ε. As the result, it might change rt which results in a greater rt .
Consequently, the degradation found by minimizing the objective function (5.6)
might not be optimal. To address this problem, we propose the following objective
function to find the degradation r :
minimi ze ψ(L , l) + λL 1
ψ(L , l) =

arg maxc L = l
β × L [l]
max(L ) − L [l] other wise

(5.7)

(5.8)

In this equation, λ is the regularizing weight, β is a multiplier to penalize those values
of r that do not properly degrade the image so it is not misclassified by the ConvNet
and .1 is the sparsity inducing term that forces r to be sparse. The above objective
function finds the value r such that degrading the input image f using r causes the
image to be classified incorrectly and the difference between the highest score in L
and the true label of f is minimum. This guarantees that f + r will be outside the
decision boundary of actual class l but it will be as close as possible to this decision
boundary.
We minimize the objective function (5.7) using genetic algorithms. To this end,
we use real-value encoding scheme for representing the population. The size of each
chromosome in the population is equal to the number of the elements in r . Each
chromosome, represents a solution for r . We use tournament method with tour size
5 for selecting the offspring. Then, a new offspring is generated using arithmetic,
intermediate or uniform crossover operators. Finally, the offspring is mutated by
adding a small number in range [−10, 10] on some of the genes in the population.
Finally, we use elitism to always keep the best solution in the population. We applied
the optimization procedure for one image from each traffic sign classes. Figure 5.18
shows the results.
Inspecting all the images in this figure, we realize that the ConvNet can easily
make mistakes even for noises which are not perceivable by human eye. This conclusion is also made by Szegedy et al. (2014b). This suggests that the function presenting
by the ConvNet is highly nonlinear where small changes in the input may cause a

216

5 Classification of Traffic Signs

Fig. 5.18 Minimum additive noise which causes the traffic sign to be misclassified by the minimum
different compared with the highest score

significant change in the output. When the output changes dramatically, it might fall
in a wrong class in the feature space. Hence, the image is incorrectly classified. Note
that, because of the proposed objective function, the difference between the wrongly
predicted class and the true class is positive but it is very close the decision boundary
of the two classes. We repeated the above procedure on 15 different images and calculated the mean Signal-to-Noise Ratio (SNR) of each class, separately. Figure 5.19
shows the results. First, we observe that classes 4 and 30 have the lowest SNR values.
In other words, the images from these two classes are more tolerant against noise. In
addition, class 15 has the highest SNR values which shows it is more prone to be misclassified with small changes. Finally, most of the classes are tolerant against noise
with approximately the same degree of tolerance since they have close mean SNR
values. One simple solution to increase the tolerance of the ConvNet is to augment

Fig. 5.19 Plot of the SNRs of the noisy images found by optimizing (5.7). The mean SNR and its
variance are illustrated

5.7 Evaluating Networks

217

Fig. 5.20 Visualization of the transformation and the first convolution layers

noisy images with various SNR values so the network can learn how to handle small
changes.

5.7.3.1 Effect of Linear Transformation
We manually inspected the database and realized that there are images with poor
illumination. In fact, the transformation layer enhances the illumination of the input
image by multiplying the each channel with different constant factors and adding
different intercepts to the result. Note that there is a unique transformation function
per each channel. This is different from applying the same linear transformation
function on all channels in which it does not have any effect on the results of convolution filters in the next layer (unless the transformation causes the intensity of
the pixels exceed their limits). In this ConvNet, applying a different transformation
function on each channel affects the output of the subsequent convolution layer. By
this way, the transformation layer learns the parameters of the linear transformation
such that it increases the classification accuracy. Figure 5.20 illustrates the output of
the transformation and the first convolution layers. We observe that the input image
suffers from a poor illumination. However, applying the linear transformation on the
image enhances the illumination of each channel differently and, consequently, the
subsequent layers represent the image properly so it is classified correctly.

5.7.4 Analyzing by Visualization
Visualizing ConvNets helps to understand them under different circumstances. In
this section, we propose a new method for assessing the stability of the network
and, then, conduct various visualization techniques to analyze different aspects of
the proposed ConvNet.

5.8 Analyzing by Visualizing
Understanding the underlying process of ConvNets is not trivial. To be more specific,
it is not easy to mathematically analyze a particular layer/neuron and determine
what the layer/neuron exactly does on the input and what is extracted using the

218

5 Classification of Traffic Signs

layer/neuron. Visualizing is a technique that helps to better understand underlying
process of a ConvNet. There are several ways for visualizing a ConvNet. In this
section, we will explain a few techniques that can be utilized in practical applications.

5.8.1 Visualizing Sensitivity
Assume we are given a pure image which is classified correctly by the ConvNet. We
might be interested in localizing those areas on the image where degrading one of
these areas by noise causes the image to be misclassified. This helps us to identify
the sensitive regions of each traffic sign. To this end, we start from a window size
equal to 20% of the image size and slide this window on the image. At each location,
the region under the window is degraded by noise and the classification score of
the image is computed. By this way, we obtain a score matrix H c where element
H c (m, n) is the score of the image belonging to class c when a small region of the
image starting from (m, n) is degraded by noise (i.e., (m, n) is the top-left corner of
the window not its center). We computed the matrix Hic , i ∈ 1 . . . 20 for 20 different
20

H

i
instances of the same class and calculated the average matrix H̄ c = i=1
as well
20
as the average image. Figure 5.21 illustrates the heat map of H̄ . First, we observe
that the ConvNet is mainly sensitive to small portion of the pictographs in the traffic
signs. For example, in the speed limits signs related to speeds less than 100, it is clear
that the ConvNet is mainly sensitive to some part of the first digit. Conversely, the
score is affected by whole three digits in the “speed limit 100” sign. In addition, the
score of the “speed limit 120” sign mainly depends on the second digit. These are
all reasonable choices made by the ConvNet since the best way to classify two-digit
speed limit signs is to compare their first digit. In addition, the “speed limit 100” is
differentiable from “speed limit 120” sign through only the middle digit.
Furthermore, there are traffic signs such as the “give way” and the “no entry” signs
in which the ConvNet is sensitive in almost every location on the image. In other

Fig. 5.21 Classification score of traffic signs averaged over 20 instances per each traffic sign. The
warmer color indicates a higher score and the colder color shows a lower score. The corresponding
window of element (m, n) in the score matrix is shown for one instance. It should be noted that the
(m, n) is the top-left corner of the window not its center and the size of the window is 20% of the
image size in all the results

5.8 Analyzing by Visualizing

219

Fig. 5.22 Classification score of traffic signs averaged over 20 instances per each traffic sign. The
warmer color indicates a higher score. The corresponding window of element (m, n) in the score
matrix is shown for one instance. It should be noted that the (m, n) is the top-left corner of the
window not its center and the size of the window is 40% of the image size in all the results

words, the score of the ConvNet is affected regardless of the position of the degraded
region when the size of the degradation window is 20% of the image. We increased
the size of the window to 40% and repeated the above procedure. Figure 5.22 shows
the result. We still see that all analysis mentioned for window size 20% hold true for
window size 40%, as well. In particular, we observe that the most sensitive regions
of the “mandatory turn left” and the “mandatory turn right” traffic signs emerge by
increasing the window size. Notwithstanding, degradation affects the classification
score regardless of its location in these two signs.

5.8.2 Visualizing the Minimum Perception
Classifying traffic signs at night is difficult because perception of the traffic signs is
very limited. In particular, the situation is much worse in interurban areas at which
the only lightening source is the headlights of the car. Unless the car is very close
to the signs, it is highly probable that the traffic signs are partially perceived by the
camera. In other words, most part of the perceived image might be dark. Hence, this
question arises that “what is the minimum area to be perceived by the camera to
successfully classify the traffic signs”.

220

5 Classification of Traffic Signs

Fig. 5.23 Classification score of traffic signs averaged over 20 instances per each traffic sign. The
warmer color indicates a higher score. The corresponding window of element (m, n) in the score
matrix is shown for one instance. It should be noted that the (m, n) is the top-left corner of the
window not its center and the size of the window is 40% of the image size in all the results

To answer this question, we start from a window size equal to 40% of the image
size and slide this window on the image. At each location, we keep the pixels under
the window untouched and zero out the rest of the pixels. Then, the image is entered
into the ConvNet and the classification score is computed. By this way, we obtain
a score matrix H where element H (m, n) is the score of the traffic sign when only
a small region of the image starting from (m, n) is perceived by the camera. As
before, we computed the average score matrix H̄ using 20 instances for each traffic
sign. Figure 5.23 illustrates the heat map plot of H̄ obtained by sliding a window
which its size is 40% of the image size. Based on this figure, we realize that in most
of the traffic signs, the pictograph is the region with highest response. In particular,
some parts of the pictograph have the greatest importance to successfully identify the
traffic signs. However, there are signs such as the “priority road” sign which are not
recognizable using 40% of the image. It seems instead of pictograph, the ConvNet
learns to detect color blobs as well as the shape information of the sign to recognize
these traffic signs. We also computed the results obtained by increasing the window
size to 60%. Nonetheless, since the same analysis applies on these results we do
not show them in this section to avoid redundancy of figures. But, these results are
illustrated in the supplementary material.

5.8.3 Visualizing Activations
We can think of the value of the activation functions as the amount of excitement
of a neuron to the input image. Since the output of the neuron is linearly combined
using the neuron in the next layer, then, as the level of excitement increases, it also
changes the output of the subsequent neurons in the next layer. So, it is a common
practice to inspect which images significantly excite a particular neuron.
To this, we enter all the images in the test set of the GTSRB dataset into the
ConvNet and keep record of the activation of neuron (k, m, n) in the last pooling
layer where m and n depict the coordinates of the neuron in channel k of the last
pooling result. According to Fig. 5.11, there are 250 channels in the last pooling

5.8 Analyzing by Visualizing

221

Fig. 5.24 Receptive field of some neurons in the last pooling layer

Fig. 5.25 Average image computed over each of 250 channels using the 100 images with highest
value in position (0, 0) of the last pooling layer. The corresponding receptive field of this position
is shown using a cyan rectangle

layer and each channel is a 3 × 3 matrix. Then, the images are sorted in descending
order according to their value in position (k, m, n) of the last pooling layer and the
average of the first 100 images is computed. It should be noted that each location
(m, n) in the pooling layer has a corresponding receptive field in the input image. To
compute the receptive field of each position we must back project the results from
the last pooling layer to the input layer. Figure 5.24 shows the receptive field of some
neurons in the last pooling layer.
We computed the average image of each neuron in the last pooling layer as we
mentioned above. This is shown in Fig. 5.25 where each image im i depicts the receptive field of the neuron (0, 0) from i th channel in the last pooling layer. According to
these figures, most of the neurons in the last pooling layer are mainly activated by a
specific traffic sign. There are also some neurons which are highly activated by more
than one traffic signs. To be more specific, these neurons are mainly sensitive to 2–4
traffic signs. By entering an image of a traffic sign to the ConvNet, some of these
neurons are highly activated while other neurons are deactivated (they are usually

222

5 Classification of Traffic Signs

close to zero or negative). The pattern of highly activated neurons are different for
each traffic sign and this is the reason that the classes become linearly separable in
the last pooling layer.

5.9 More Accurate ConvNet
Visualizing the ConvNet in Fig. 5.11 showed that classification of traffic signs is
mainly done using the shape and the pictograph of traffic signs. Therefore, it is
possible to discard color information and use only gray-scale pixels to learn a representation by the ConvNet.
In this section, we will train a more accurate and less computational ConvNet for
classification of traffic signs. To this end, Habibi Aghdam et al. (2016) computed
the layerwise time-to-completion of the network in Fig. 5.11 using the command
line tools in the Caffe library. More specifically, executing the caffe command with
parameter time will analyze the run-time of the given network and return a layerwise summary. Table 5.8 shows the results in milliseconds. We observe that the two
middle layers with 4 × 4 kernels consume most of the GPU time. Moreover, the
fully connected layer does not significantly affect the overall time-to-completion.
Likewise, the first convolution layer can be optimized by reducing the size of the
kernel and number of the input channels.
From accuracy perspective, the aim is to reach a accuracy higher than the previously trained network. The basic idea behind ConvNets is to learn a representation
which makes objects linearly separable in the last layer. Fully connected layers facilitates this by learning a nonlinear transformation to project the representation into
another space. We can increase the degree of freedom of the ConvNet by adding more
fully connected layers to it. This may help to learn a better linearly separable representation. Based these ideas, Habibi Aghdam et al. (2016) proposed the ConvNet
illustrated in Fig. 5.26.
First, the color image is replaced with gray-scale image in this ConvNet. In addition, because a gray-scale image is a single-channel input, the linear transformation
layer must be also discarded. Second, we have utilized Parametric Rectified Linear
Units (PReLU) to learn separate αi for each feature map in a layer where αi depicts
the value of leaking parameter in LReLU. Third, we have added another fully connected layer to the network to increase its flexibility. Fourth, the size of the first
kernel and the middle kernels have been reduced to 5 × 5 and 3 × 3, respectively.

Table 5.8 Per layer time-to-completion (milliseconds) of the previous classification ConvNet
Layer

Data

c1 × 1

c7 × 7

pool1

c4 × 4

pool2

c4 × 4

pool3

fc

Time
(ms)

0.032

0.078

0.082

0.025

0.162

0.013

0.230

0.013

0.062 0.032

Class

5.9 More Accurate ConvNet

223

Fig. 5.26 The modified ConvNet architecture compare with Fig. 5.11
Table 5.9 Per layer time-to-completion (milliseconds) of the classification ConvNet in Habibi
Aghdam et al. (2016)
Layer

Data

c5 × 5

pool1

c3 × 3

pool2

c3 × 3

pool3

fc1

fc2

Time
(ms)

0.036

0.076

0.0166

0.149

0.0180

0.159

0.0128

0.071

0.037 0.032

Class

Last but not the least, the size of the input image is reduced to 44 × 44 pixels to reduce
the dimensionality of the feature vector in the last convolution layer. Table 5.9 shows
the layer-wise time-to-completion of the ConvNet illustrated in Fig. 5.26.
According to this table, time-to-completion of the middle layers has been reduced.
Especially, time-to-completion of the last convolution layer has been reduced substantially. In addition, the ConvNet has saved 0.078 ms by removing c1 × 1 layer
in the previous architecture. It is worth mentioning that the overhead caused by
the second fully connected layer is slight. In sum, the overall time-to-completion
of the above ConvNet is less than the ConvNet in Fig. 5.11. Finally, we investigated the effect of batch size on the time-to-completion of ConvNet. Figure 5.27
illustrates the relation between the batch size of the classification ConvNet and its
time-to-completion.
According to the figure, while processing 1 image takes approximately 0.7 ms
using the classification, processing 50 images takes approximately 3.5 ms using
the same ConvNet (due to parallel architecture on the GPU). In other words, if the
detection ConvNet generates 10 samples, we do not need to enter the samples one
by one to the ConvNet. This will take 0.7 × 10 = 7 ms to complete. Instead, we can
fetch a batch of 10 samples to the network and process them in approximately 1 ms.
By this way, we can save more GPU time.

224

5 Classification of Traffic Signs

Fig. 5.27 Relation between
the batch size and
time-to-completion of the
ConvNet

5.9.1 Evaluation
The classification ConvNet is also trained using the mini-batch stochastic gradient
descent (batch size = 50) with exponential learning rate annealing. We fix the learning
rate to 0.02, momentum to 0.9, L2 regularization to 10−5 , annealing parameter to
0.99996, dropout ratio to 0.5, and initial value of leaking parameters to 0.01. The
network is trained 12 times and their classification accuracies on the test set are
calculated. It is worth mentioning that Stallkamp et al. (2012) have only reported the
classification accuracy and it is the only way to compare the following results with
Ciresan et al. (2012b) and Sermanet and Lecun (2011).
Table 5.10 shows the result of training 10 ConvNets with the same architecture
and different initializations. The average classification accuracy of the 10 trials is
99.34% which is higher than the average human performance reported in Stallkamp
et al. (2012). In addition, the standard deviation of the classification accuracy is
small which shows that the proposed architecture trains the networks with very close

Table 5.10 Classification accuracy of the single network
Trial

top-1 acc. (%)

top-2 acc. (%)

Top 1 acc. (%)

Top 2 acc. (%)

1

99.21

99.77

Trial
6

99.54

99.78

2

99.38

99.73

7

99.25

99.70

3

99.55

99.83

8

99.21

99.73

4

99.16

99.72

9

99.53

99.82

5

99.35

99.75

10

99.24

99.64

Average top-1

99.34 ± 0.02

Average top-2

99.75 ± 0.002

Human top-1

98.84

Human top-2

NA

5.9 More Accurate ConvNet

225

Table 5.11 Comparing the results with ConvNets in Ciresan et al. (2012a, b), Stallkamp et al.
(2012), and Sermanet and Lecun (2011)
ConvNet

Accuracy(%)

Single ConvNet (best) (Ciresan et al. 2012a, b)

98.80

Single ConvNet (avg.) (Ciresan et al. 2012a, b)

98.52

Multi-scale ConvNet (official) (Stallkamp et al. 2012)

98.31

Multi-scale ConvNet (best) (Sermanet and Lecun 2011)

98.97

Proposed ConvNet (best)

99.55

Proposed ConvNet (avg.)

99.34

Committee of 25 ConvNets (Ciresan et al. 2012a, b; Stallkamp et al. 2012) 99.46
Ensemble of 3 proposed ConvNets

99.70

accuracies. We argue that this stability is the result of reduction in number of the
parameters and regularizing the network using a dropout layer. Moreover, we observe
that the top-210 accuracy is very close in all trials and their standard deviation is 0.002.
In other words, although the difference in top-1 accuracy of the Trial 5 and the Trial
6 is 0.19%, notwithstanding, the same difference for top-2 accuracy is 0.03%. This
implies that some cases are always within the top-2 results. In other words, there are
images that have been classified correctly in Trial 5 but they have been misclassified
in Trial 6 (or vice versa). As a consequence, if we fuse the scores of two networks
the classification accuracy might increase.
Based on this observation, an ensemble was created using the optimal subset
selection method. The created ensemble consists of three ConvNets (ConvNets 5, 6,
and 9 in Table 5.1). As it is shown in Table 5.11, the overall accuracy of the network
increases to 99.70% by this way. Furthermore, the proposed method has established
a new record compared with the winner network reported in the competition Stallkamp et al. (2012). Besides, we observe that the results of the single network has
outperformed the two other ConvNets. Depending on the application, one can use
the single ConvNet instead of ensemble since it already outperforms state-of-art
methods as well as human performance with much less time-to-completion.
Misclassified images: We computed the class-specific precision and recall
(Table 5.12). Besides, Fig. 5.28 illustrates the incorrectly classified traffic signs. The
number below each image shows the predicted class label. For presentation purposes,
all images were scaled to a fixed size. First, we observe that there are 4 cases where
the images are incorrectly classified as class 5 while the true label is 3. We note that
all these cases are degraded. Moreover, class 3 is distinguishable from class 5 using
the fine differences in the first digit of the sign. However, because of degradation the
ConvNet is not able to recognize the first digit correctly. In addition, by inspecting
the rest of the misclassified images, we realize that the wrong classification is mainly

10 Percent

of the samples which are always within the top 2 classification scores.

226

5 Classification of Traffic Signs

Table 5.12 Class specific precision and recall obtained by the network in Habibi Aghdam et al.
(2016). Bottom images show corresponding class number of each traffic sign
class precision recall class precision recall class precision recall
0
1.00
1.00 15
1.00
1.00 30
1.00
0.98
1
1.00
1.00 16
1.00
1.00 31
1.00
1.00
2
1.00
1.00 17
0.99
1.00 32
1.00
1.00
3
1.00
0.99 18
0.99
1.00 33
1.00
1.00
4
1.00
1.00 19
0.98
1.00 34
1.00
1.00
5
0.99
1.00 20
1.00
1.00 35
0.99
1.00
6
1.00
1.00 21
1.00
1.00 36
0.99
1.00
7
1.00
1.00 22
1.00
1.00 37
0.97
1.00
8
1.00
1.00 23
1.00
1.00 38
1.00
0.99
9
1.00
1.00 24
1.00
0.96 39
1.00
0.97
10
1.00
1.00 25
1.00
1.00 40
0.97
0.97
11
0.99
1.00 26
0.97
1.00 41
1.00
1.00
12
1.00
0.99 27
0.98
1.00 42
1.00
1.00
13
1.00
1.00 28
1.00
1.00
14
1.00
1.00 29
1.00
1.00

Fig. 5.28 Misclassified traffic sings. The blue and the red number indicate the actual and predicted
class labels, respectively

due to occlusion of the signs and blurry or degraded images. In addition, the class
specific precision and recall show that the ConvNet is very accurate in classifying
the traffic signs in all the classes.

5.9.2 Stability Against Noise
In real applications, it is necessary to study stability of the ConvNet against image
degradations. To empirically study the stability of the classification ConvNet against
Gaussian noise, the following procedure is conducted. First, we pick the test

5.9 More Accurate ConvNet

227

Table 5.13 Accuracy of the ConvNets obtained by degrading the correctly classified test images
in the original datasets using a Gaussian noise with various values of σ
Accuracy (%) for different values of σ
1

2

4

8

10

15

20

25

30

40

Single

99.4

99.4

99.3

98.9

98.3

96.3

93.2

89.7

86.0

78.7

Ensemble

99.5

99.5

99.4

99.2

98.8

97.1

94.4

91.4

88.0

81.4

Correctly classified samples
Single

99.95

99.94

99.9

99.3

98.8

96.9

93.8

90.3

86.6

79.3

Ensemble

99.94

99.93

99.9

99.5

99.1

97.5

95.0

92.0

88.7

82.1

images from the original datasets. Then, 100 noisy images are generated for each
σ ∈ {1, 2, 4, 8, 10, 15, 20, 25, 30, 40}. In other words, 1000 noisy images are generated for each test image in the original dataset. Next, each noisy image is entered
to the ConvNet and its class label is computed. Table 5.13 reports the accuracy of
the single ConvNet and the ensemble of three ConvNets per each value of σ . It is
divided into two sections. In the first section, the accuracies are calculated on the
all images. In the second section, we have only considered the noisy images whose
clean version are correctly classified by the single model and the ensemble model.
Our aim in the second section is to study how noise may affect a sample which is
originally classified correctly by our models.
According to this table, there are cases in which adding a Gaussian noise with
σ = 1 on the images causes the models to incorrectly classify the noisy image. Note
that, a Gaussian noise with σ = 1 is not easily perceivable by human eye. However,
it may alter the classification result. Furthermore, there are also a few clean images
that have been correctly classified by both models but they are misclassified after
adding a Gaussian noise with σ = 1. Notwithstanding, we observe that both models
generate admissible results when σ < 10.
This phenomena is partially studied by Szegedy et al. (2014b) and Aghdam et al.
(2016c). The above behavior is mainly due to two reasons. First, the interclass margins might be very small in some regions in the feature space where a sample may fall
into another class using a slight change in the feature space. Second, ConvNets are
highly nonlinear functions where a small change in the input may cause a significant
change in the output (feature vector) where samples may fall into a region representing to another class. To investigate nonlinearity of the ConvNet, we computed
the Lipschitz constant of the ConvNet locally. Denoting the transformation from the
input layer up to layer f c2 by C f c2 (x) where m ∈ RW ×H is a gray-scale image,
we compute the Lipschitz constant for every noisy image x + N (0, σ ) using the
following equation:
d(x, x + N (0, σ )) ≤ K d(C f c2 (x), C f c2 (x + N (0, σ )))

(5.9)

where K is the Lipschitz constant and d(a, b) computes the Euclidean distance
between a and b. For each traffic sign category in the GTSRB dataset, we pick

228

5 Classification of Traffic Signs

Fig. 5.29 Lipschitz constant (top) and the correlation between d(x, x + N (0, σ )) and
d(C f c2 (x), C f c2 (x + N (0, σ ))) (bottom) computed on 100 samples from every category in the
GTSRB dataset. The red circles are the noisy instances that are incorrectly classified. The size of
each circle is associated with the values of σ in the Gaussian noise

100 correctly classified samples and compute the Lipschitz constant between the
clean images and their noisy versions. The top graph in Fig. 5.29 illustrates the
Lipschitz constant for each sample, separately. Besides, the bottom graph shows
the d(x, x + N (0, σ )) and d(C f c2 (x), C f c2 (x + N (0, σ ))). The black and blue
lines are the linear regression and second order polynomial fitted on the point. The
size of circles in the figure is associated with the value of σ in the Gaussian noise. A
sample with a bigger σ appears bigger on the plot. In addition, the red circles shows
the samples which are incorrectly classified after adding a noise to them.

5.9 More Accurate ConvNet

229

There are some important founding in this figure. First, C f c2 is locally contraction
in some regions since there are instances in which their Lipschitz constant is 0 ≤
K ≤ 1 regardless of the value of σ . Also, K ∈ [ε, 2.5) which means that the ConvNet
is very nonlinear in some regions. Besides, we also see that there are some instances
which their Lipschitz constants are small but they are incorrectly classified. This
could be due to the first reason that we mentioned above. Interestingly, we also
observe that there are some cases where the image is degraded using a low magnitude
noise (very small dots in the plot) but its Lipschitz constant is very large meaning
that in that particular region, the ConvNet is very nonlinear along a specific direction.
Finally, we also found out that misclassification can happen regardless of value of
the Lipschitz constant.

5.9.3 Visualization
As we mentioned earlier, an effective way to examine each layer is to nonlinearly
map the feature vector of a specific layer into a two-dimensional space using the
t-SNE method (Maaten and Hinton 2008). This visualization is important since it
shows how discriminating are different layers and how a layer changes the behavior
of the previous layer. Although there are other techniques such as Local Linear
Embedding, Isomap and Laplacian Eigenmaps, the t-SNE method usually provides
better results given high dimensional vectors. We applied this method on the fully
connected layer before the classification layer as well as the last pooling layer on
the detection and the classification ConvNets individually. Figure 5.30 illustrates the
results for the classification ConvNets.

Fig. 5.30 Visualizing the relu4 (left) and the pooling3 (right) layers in the classification ConvNet
using the t-SNE method. Each class is shown using a different color

230

5 Classification of Traffic Signs

Fig. 5.31 Histogram of leaking parameters

Comparing the results of the classification ConvNet show that although the traffic
sign classes are fairly separated in the last pooling layer, the fully connected layers
increase the separability of the classes and make them linearly separable. Moreover,
we observe that the two classes in the detection ConvNet are not separable in the
pooling layer. However, the fully connected layer makes these two classes to be effectively separable. These results also explain why the accuracy of the above ConvNets
are high. This is due to the fact that both ConvNets are able to accurately disperse
the classes using the fully connected layers before the classification layers.
Leaking parameters: We initialize the leaking parameters of all PReLU units in the
classification ConvNet to 0.01. In practice, applying PReLU activations takes slightly
more time compared with LReLU activations in Caffe framework. It is important to
study the distribution of the leaking parameters to see if we can replace them with
LReLU parameters. To this end, we computed the histogram of leaking parameters
for each layer, separately. Figure 5.31 shows the results.
According to the histograms, mean of leaking parameters for each layer is different
except for the first and the second layers. In addition, variance of each layer is
different. One can replace the PReLU activations with ReLU activations and set the
leaking parameter of each layer to the mean of leaking parameter in this figure. By
this way, time-to-completion of ConvNet will be reduced. However, it is not clear if
it will have a negative impact on the accuracy. In the future work, we will investigate
this setting.

5.10 Summary
This chapter started with reviewing related work in the field of traffic sign classification. Then, it explained the necessity of splitting data and some of methods for
splitting data into training, validation and test sets. A network should be constantly
assessed during training in order to diagnose it if it is necessary. For this reason,
we showed how to train a network using Python interface of Caffe and evaluate it
constantly using training-validation curve. We also explained different scenarios that
may happen during training together with their causes and remedies. Then, some of

5.10 Summary

231

the successful architectures that are proposed in literature for classification of traffic
signs were introduced. We implemented and trained these architectures and analyzed
their training-validation plots.
Creating ensemble is a method to increase classification accuracy. We mentioned
various methods that can be used for creating ensemble of models. Then, a method
based on optimal subset selection using genetic algorithms were discussed. This way,
we create ensembles with minimum number of models that together they increase
the classification accuracy.
After that, we showed how to interpret and analyze quantitative results such as
precision, recall, and accuracy on a real dataset of traffic signs. We also explained
how to understand behavior of convolutional neural networks using data-driven visualization techniques and nonlinear embedding methods such as t-SNE.
Finally, we finished the chapter by implementing a more accurate and computationally efficient network that is proposed in literature. The performance of this
network was also analyzed using various metrics and from different perspective.

5.11 Exercises
5.1 Why if test set and training sets are not drawn from the same distribution, the
model might not be accurate enough on the test set?
5.2 When splitting the dataset into training, validation, and test set, each sample in
the dataset is always assigned to one of these sets. Why it is not correct to assign a
sample to more than one set?
5.3 Computer the number of multiplications of the network in Fig. 5.9.
5.4 Change the pooling size from 3 × 3 to 2 × 2 and trained the networks again.
Does that affect the accuracy? Can we generalize the result to any other datasets?
5.5 Replace the leaky ReLU with ReLU in the network illustrated in Fig. 5.11 and
train the network again? Does it have any impact on optimization algorithm or accuracy?
5.6 Change the regularization coefficient to 0.01 and trained the network. Explain
the results.

232

5 Classification of Traffic Signs

References
Aghdam HH, Heravi EJ, Puig D (2015) A unified framework for coarse-to-fine recognition of traffic
signs using bayesian network and visual attributes. In: Proceedings of the 10th international conference on computer vision theory and applications, pp 87–96. doi:10.5220/0005303500870096
Aghdam HH, Heravi EJ, Puig D (2016a) A practical and highly optimized convolutional neural
network for classifying traffic signs in real-time. Int J Comput Vis 1–24. doi:10.1007/s11263016-0955-9
Aghdam HH, Heravi EJ, Puig D (2016b) Analyzing the stability of convolutional neural networks
against image degradation. In: Proceedings of the 11th international conference on computer
vision theory and applications, vol 4(Visigrapp), pp 370–382. doi:10.5220/0005720703700382
Aghdam HH, Heravi EJ, Puig D (2016c) Computer vision ECCV 2016. Workshops 9913:178–191.
doi:10.1007/978-3-319-46604-0
Baró X, Escalera S, Vitrià J, Pujol O, Radeva P (2009) Traffic sign recognition using evolutionary
adaboost detection and forest-ECOC classification. IEEE Trans Intell Transp Syst 10(1):113–126.
doi:10.1109/TITS.2008.2011702
Ciresan D, Meier U, Schmidhuber J (2012a) Multi-column deep neural networks for image classification. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 3642–
3649. doi:10.1109/CVPR.2012.6248110, arXiv:1202.2745v1
Ciresan D, Meier U, Masci J, Schmidhuber J (2012b) Multi-column deep neural network for traffic
sign classification. Neural Netw 32:333–338. doi:10.1016/j.neunet.2012.02.023
Coates A, Ng AY (2012) Learning feature representations with K-means. Lecture notes in computer
science (Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 7700.
LECTU:561–580. doi:10.1007/978-3-642-35289-8-30
Fleyeh H, Davami E (2011) Eigen-based traffic sign recognition. IET Intell Transp Syst 5(3):190.
doi:10.1049/iet-its.2010.0159
Gao XW, Podladchikova L, Shaposhnikov D, Hong K, Shevtsova N (2006) Recognition of traffic
signs based on their colour and shape features extracted using human vision models. J V Commun
Image Represent 17(4):675–685. doi:10.1016/j.jvcir.2005.10.003
Greenhalgh J, Mirmehdi M (2012) Real-time detection and recognition of road traffic signs. IEEE
Trans Intell Transp Syst 13(4):1498–1506. doi:10.1109/tits.2012.2208909
Habibi Aghdam H, Jahani Heravi E, Puig D (2016) A practical approach for detection and classification of traffic signs using convolutional neural networks. Robot Auton Syst 84:97–112. doi:10.
1016/j.robot.2016.07.003
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. arXiv:1502.01852
Hinton G (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn
Res (JMLR) 15:1929–1958
Hsu SH, Huang CL (2001) Road sign detection and recognition using matching pursuit method.
Image Vis Comput 19(3):119–129. doi:10.1016/S0262-8856(00)00050-0
Huang Gb, Mao KZ, Siew Ck, Huang Ds (2013) A hierarchical method for traffic sign classification
with support vector machines. In: The 2013 international joint conference on neural networks
(IJCNN), IEEE, pp 1–6. doi:10.1109/IJCNN.2013.6706803
Jin J, Fu K, Zhang C (2014) Traffic sign recognition with hinge loss trained convolutional neural
networks. IEEE Trans Intell Transp Syst 15(5):1991–2000. doi:10.1109/TITS.2014.2308281
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural
networks. In: Advances in neural information processing systems. Curran Associates, Inc., Red
Hook, pp 1097–1105
Larsson F, Felsberg M (2011) Using fourier descriptors and spatial models for traffic sign recognition. In: Image analysis lecture notes in computer science, vol 6688. Springer, Berlin, pp 238–249.
doi:10.1007/978-3-642-21227-7_23

References

233

Liu H, Liu Y, Sun F (2014) Traffic sign recognition using group sparse coding. Inf Sci 266:75–89.
doi:10.1016/j.ins.2014.01.010
Lu K, Ding Z, Ge S (2012) Sparse-representation-based graph embedding for traffic sign recognition.
IEEE Trans Intell Transp Syst 13(4):1515–1524. doi:10.1109/TITS.2012.2220965
Maaten LVD, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605.
doi:10.1007/s10479-011-0841-3
Maldonado-Bascon S, Lafuente-Arroyo S, Gil-Jimenez P, Gomez-Moreno H, Lopez-Ferreras F
(2007) Road-sign detection and recognition based on support vector machines. IEEE Trans Intell
Transp Syst 8(2):264–278. doi:10.1109/TITS.2007.895311
Maldonado Bascón S, Acevedo Rodríguez J, Lafuente Arroyo S, Fernndez Caballero A, LópezFerreras F (2010) An optimization on pictogram identification for the road-sign recognition task
using SVMs. Comput Vis Image Underst 114(3):373–383. doi:10.1016/j.cviu.2009.12.002
Mathias M, Timofte R, Benenson R, Van Gool L (2013) Traffic sign recognition – how far are we
from the solution? Proc Int Jt Conf Neural Netw. doi:10.1109/IJCNN.2013.6707049
Møgelmose A, Trivedi MM, Moeslund TB (2012) Vision-based traffic sign detection and analysis
for intelligent driver assistance systems: perspectives and survey. IEEE Trans Intell Transp Syst
13(4):1484–1497. doi:10.1109/TITS.2012.2209421
Moiseev B, Konev A, Chigorin A, Konushin A (2013) Evaluation of traffic sign recognition methods
trained on synthetically generated data. In: 15th international conference on advanced concepts
for intelligent vision systems (ACIVS). Springer, Poznań, pp 576–583
Paclík P, Novovičová J, Pudil P, Somol P (2000) Road sign classification using Laplace kernel
classifier. Pattern Recognit Lett 21(13–14):1165–1173. doi:10.1016/S0167-8655(00)00078-7
Piccioli G, De Micheli E, Parodi P, Campani M (1996) Robust method for road sign detection and
recognition. Image Vis Comput 14(3):209–223. doi:10.1016/0262-8856(95)01057-2
Radu Timofte LVG (2011) Sparse representation based projections. In: 22nd British machine vision
conference, BMVA Press, pp 61.1–61.12. doi:10.5244/C.25.61
Ruta A, Li Y, Liu X (2010) Robust class similarity measure for traffic sign recognition. IEEE Trans
Intell Transp Syst 11(4):846–855. doi:10.1109/TITS.2010.2051427
Sermanet P, Lecun Y (2011) Traffic sign recognition with multi-scale convoltional networks. In:
Proceedings of the international joint conference on neural networks, pp 2809–2813. doi:10.
1109/IJCNN.2011.6033589
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) OverFeat: integrated
recognition, localization and detection using convolutional networks, pp 1–15. arXiv:1312.6229
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representation (ICLR), pp 1–13. arXiv:1409.1556v5
Stallkamp J, Schlipsing M, Salmen J, Igel C (2012) Man vs. computer: benchmarking machine
learning algorithms for traffic sign recognition. Neural Netw 32:323–332. doi:10.1016/j.neunet.
2012.02.016
Sun ZL, Wang H, Lau WS, Seet G, Wang D (2014) Application of BW-ELM model on traffic sign
recognition. Neurocomputing 128:153–159. doi:10.1016/j.neucom.2012.11.057
Szegedy C, Reed S, Sermanet P, Vanhoucke V, Rabinovich A (2014a) Going deeper with convolutions, pp 1–12. arXiv:1409.4842
Szegedy C, Zaremba W, Sutskever I (2014b) Intriguing properties of neural networks.
arXiv:1312.6199v4
Tibshirani R (1994) Regression selection and shrinkage via the Lasso. doi:10.2307/2346178
Timofte R, Zimmermann K, Van Gool L (2011) Multi-view traffic sign detection, recognition, and
3D localisation. Mach Vis Appl 1–15. doi:10.1007/s00138-011-0391-3
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for
image classification. In: 2010 IEEE computer society conference on computer vision and pattern
recognition, IEEE, pp 3360–3367. doi:10.1109/CVPR.2010.5540018

234

5 Classification of Traffic Signs

Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural
networks? Nips’14 vol 27. arXiv:1411.1792v1
Yuan X, Hao X, Chen H, Wei X (2014) Robust traffic sign recognition based on color global and
local oriented edge magnitude patterns. IEEE Trans Intell Transp Syst 15(4):1466–1474. doi:10.
1109/TITS.2014.2298912
Zaklouta F, Stanciulescu B (2012) Real-time traffic-sign recognition using tree classifiers. IEEE
Trans Intell Transp Syst 13(4):1507–1514. doi:10.1109/TITS.2012.2225618
Zaklouta F, Stanciulescu B (2014) Real-time traffic sign recognition in three stages. Robot Auton
Syst 62(1):16–24. doi:10.1016/j.robot.2012.07.019
Zaklouta F, Stanciulescu B, Hamdoun O (2011) Traffic sign classification using K-d trees and
random forests. In: Proceedings of the international joint conference on neural networks, pp
2151–2155. doi:10.1109/IJCNN.2011.6033494
Zeng Y, Xu X, Fang Y, Zhao K (2015) Traffic sign recognition using deep convolutional networks
and extreme learning machine. In: Intelligence science and big data engineering. Image and video
data engineering (IScIDE). Springer, Berlin, pp 272–280

6

Detecting Traffic Signs

6.1 Introduction
Recognizing traffic signs is mainly done in two stages including detection and classification. The detection module performs a multi-scale analysis on the image in
order to locate the patches consisting only of one traffic sign. Next, the classification
module analyzes each patch individually and classifies them into classes of traffic
signs.
The ConvNets explained in the previous chapter are only suitable for the classification module and they cannot be directly used in the task of detection. This is
due to the fact that applying these ConvNets on high-resolution images is not computationally feasible. On the other hand, accuracy of the classification module also
depends on the detection module. In other words, any false-positive results produced
by the detection module will be entered into the classification module and it will
be classified as one of traffic signs. Ideally, the false-positive rate of the detection
module must be zero and its true-positive rate must be 1. Achieving this goal usually
requires more complex image representation and classification models. However, as
the complexity of these models increases, the detection module needs more time to
complete its task.
Sermanet et al. (2013) proposed a method for implementing a multi-scale sliding
window approach within a ConvNet. Szegedy et al. (2013) formulated the object
detection problem as a regression problem to object bounding boxes. Girshick et al.
(2014) proposed a method so-called Region with ConvNets in which they apply
ConvNet to bottom-up region proposals for detecting the domain-specific objects.
Recently, Ouyang et al. (2015) developed a new pooling technique called deformation constrained pooling to model the deformation of object parts with geometric
constraint.

© Springer International Publishing AG 2017
H. Habibi Aghdam and E. Jahani Heravi, Guide to Convolutional
Neural Networks, DOI 10.1007/978-3-319-57550-6_6

235

236

6 Detecting Traffic Signs

6.2 ConvNet for Detecting Traffic Signs
In contrast to offline applications, an ADAS requires algorithms that are able to
perform their task in real time. On the other hand, the detection module consumes
more time compared with the classification module especially when it is applied on
a high-resolution image. For this reason, the detection module must be able to locate
traffic signs in real time. However, the main barrier in achieving this speed is that the
detection module must analyze high-resolution images in order to be able to locate
traffic signs that are in a distance up to 25 m. This is illustrated in Fig. 6.1 in which
the width of the image is 1020 pixels.
Assuming that the length of the bus is approximately 12.5 m, we can estimate
that the distance between the traffic sign indicated by the arrow and the camera is
approximately 20 m. Although the distance is not large, the bounding box of the
sign is nearly 20 × 20 pixels. Consequently, it is impractical to apply the detection
algorithm on low-resolution images since signs that are located in 20 m of distance
from the camera might not be recognizable. Moreover, considering that the speed
of the car is 80 km/h in an interurban road, it will travel 22 m in one second. For
this reason, the detection module must be able to analyze more than one frame per
second in order to be able to deal with high speeds motions of a car.
In practice, a car might be equipped with a stereo camera. In that case, the detection
module can be applied much faster since most of the non-traffic sign pixels can be
discarded using the distance information. In addition, the detection module can be
calibrated on a specific car and use the calibration information to ignore non-traffic
sign pixels. In this work, we propose a more general approach by considering that

Fig. 6.1 The detection module must be applied on a high-resolution image

6.2 ConvNet for Detecting Traffic Signs

237

there is only one color camera that can be mounted in front of any car. In other words,
the detection module must analyze all the patches on the image in order to identify
the traffic signs.
We trained separate traffic sign detectors using HOG and LBP features together
with a linear classifier and a random forest classifier. However, previous studies
showed that the detectors based on these features suffer from low precision and recall
values. More importantly, applying these detectors on a high-resolution image using
a CPU is impractical since it takes a long time to process the whole image. Besides, it
is not trivial to implement the whole scanning window approach using these detectors
on a GPU in order to speed up the detection process. For this reason, we developed
a lightweight but accurate ConvNet for detecting traffic signs. Figure 6.2 illustrates
the architecture of the ConvNet.
The above ConvNet is inspired by Gabor feature extraction algorithm. In this
method, a bank of convolution kernels is applied on image and the output of each
kernel is individually aggregated. Then, the final feature is obtained by concatenating
the aggregated values. Instead of handcrafted Gabor filters, the proposed ConvNet
learns a bank of 60 convolution filters each with 9 × 9 × 3 coefficients. The output
of the first convolution layer is a 12 × 12 × 60 tensor where each slice is a feature
map (i.e., 60 feature maps). Then, the aggregation is done by spatially dividing
each feature map into four equal regions and finding the maximum response in each
region. Finally, the extracted feature vector is nonlinearly transformed into a 300dimensional space where the ConvNet tries to linearly separate the two classes (traffic
sign versus non-traffic sign) in this space.
One may argue that we could attach a fully connected network to HOG features
(or other handcrafted features) and train an accurate detection model. Nonetheless,
there are two important issues with this approach. First, it is not trivial to implement
a sliding window detector using these kind of features on a GPU. Second, as we will

Fig. 6.2 The ConvNet for detecting traffic signs. The blue, green, and yellow color indicate a
convolution, LReLU and pooling layer, respectively. C(c, n, k) denotes n convolution kernel of
size k × k × c and P(k, s) denotes a max-pooling layer with pooling size k × k and stride s. Finally,
the number in the LReLU units indicate the leak coefficient of the activation function

238

6 Detecting Traffic Signs

Fig. 6.3 Applying the trained ConvNet for hard-negative mining

show in experiments, their representation power is limited and they produce more
false-positive results compared with this ConvNet.
To train the ConvNet, we collect the positive samples and pick some image patches
randomly in each image as the negative samples. After training the ConvNet using
this dataset, it is applied on each image in the training set using the multi-scale sliding
window technique in order to detect traffic signs. Figure 6.3 illustrates the result of
applying the detection ConvNet on an image.
The red, the blue, and the green rectangles show the false-positive, ground-truth,
and the true-positive patches. We observe that the ConvNet is not very accurate and it
produces some false-positive results. Although the false-positive rate can be reduced
by increasing the threshold value of the classification score, the aim is to increase
the overall accuracy of the ConvNet.
There are mainly two solutions to improve the accuracy. Either to increase the
complexity of the ConvNet or refine the current model using more appropriate data.
Increasing the complexity of the ConvNet is not practical since it will also increase its
time to completion. However, it is possible to refine the model using more appropriate
data. Here we utilize the hard-negative mining method.
In this method, the current ConvNet is applied on all the training images. Next,
all the patches that are classified as positive (the red and the green boxes) are compared with the ground-truth bounding boxes and those which do not align well are
selected as the new negative image patches (the red rectangles). They are called
hard-negative patches. Having the all hard-negative patches collected from all the
training images, the ConvNet is fine-tuned on the new dataset. Mining hard-negative
data and fine-tuning the ConvNet can be done repeatedly until the accuracy of the
ConvNet converges.

6.3 Implementing Sliding Window Within the ConvNet

239

6.3 Implementing Sliding Window Within the ConvNet
The detection procedure starts with sliding a 20 × 20 mask over the image and
classifying the patch under the mask using the detection ConvNet. After all the pixels
are scanned, the image is downsampled and the procedure repeats on the smaller
image. The downsampling can be done several times to ensure that the closer objects
will be also detected. Applying this simple procedure on a high-resolution image
may take several minutes (even on GPU because of redundancy in computation and
transferring data between the main memory and GPU). For this reason, we need to
find an efficient way for running the above procedure in real time.
Currently, advanced embedded platforms such as NVIDIA Drive Px1 come with
a dedicated GPU module. This makes it possible to execute highly computational
models in real time on these platforms. Therefore, we consider a similar platform for
running the tasks of ADAS. There are two main computational bottlenecks in naive
implementation of the sliding window detector. On the one hand, the input image
patches are very small and they may use small fraction of GPU cores to complete
a forward pass in the ConvNet. In other words, two or more image patches can be
simultaneously processed depending on the number of GPU cores. However, the
aforementioned approach considers the two consecutive patches are independent
and applies the convolution kernels on the each patches separately. On the other
hand, transferring overlapping image patches between the main memory and GPU is
done thousands of time which adversely affects the time-to-completion. To address
these two problems, we propose the following approach for implementing the sliding
window method on a GPU. Figure 6.4 shows the intuition behind this implementation.
Normally, the input of the ConvNet is a 20 × 20 × 3 image and the output of the
pooling layer is a 2×2×60 tensor. Also, each neuron in the first fully connected layer
is connected to 2 × 2 × 60 neurons in the previous layer. In this paper, traffic signs
are detected in a 1020 × 600 image. Basically, sliding window approach scans every
pixel in the image to detect the objects.2 In other words, for each pixel in the image,
the first step is to crop a 20 × 20 patch and, then, to apply the bank of the convolution
filters in the ConvNet on the patch. Next, the same procedure is repeated for the
pixel next to the current pixel. Note that 82% of the pixels are common between
two consecutive 20 × 20 patches. As the result, transferring the common pixels to
GPU memory is redundant. The solution is that the whole high-resolution image is
transferred to the GPU memory and the convolution kernels are applied on different
patches simultaneously.
The next step in the ConvNet is to aggregate the pixels in the output of the
convolution layer using the max-pooling layer. When the ConvNet is provided by a
20×20 image, the convolution layer generates a 12×12 feature map for each kernel.
Then, the pooling layer computes the maximum values in 6×6 regions. The distance
between each region is 6 pixels (stride = 6). Therefore, the output of the pooling layer

1 www.nvidia.com/object/drive-px.html.
2 We

may set the stride of scanning to two or more for computational purposes.

240

6 Detecting Traffic Signs

Fig. 6.4 Implementing the sliding window detector within the ConvNet

on single feature map is a 2 × 2 feature map. Our goal is to implement the sliding
window approach within the ConvNet. Assume the two consecutive patches indicated
by the red and green rectangles in the convolution layer (Fig. 6.4). The pooling layer
will compute the maximum value of 6 × 6 regions. Based on the original ConvNet,
the output of the pooling layer for the red rectangle must be computed using the 4
small red rectangles illustrated in the middle figure.
In addition, we also want to aggregate the values inside the green region in the next
step. Its corresponding 6 × 6 regions are illustrated using 4 small green rectangles.
Since we need to apply the pooling layer consecutively, we must change the stride of
the pooling layer to 1. With this formulation, the pooling result of the red and green
regions will not be consecutive. Rather, there will be 6 pixels gap between the two
consecutive nonoverlapping 6 × 6 regions. The pooling results of the red rectangle
are shown using 4 small filled squares in the figure. Recall from the above discussion
that each neuron in the first fully connected layer is connected to 2 × 2 × 60 regions
in the output of the pooling layer.
Based on the above discussion, we can implement the fully connected layer using
2 × 2 × 60 dilated convolution filters with dilation factor 6. Formally, a W × H
convolution kernel with dilation factor τ is applied using the following equation:
τ

( f (m, n) ∗ g) =

2


H

2


h=− H2

w=− W2

W

f (m + τ h, n + τ w)g(h, w).

(6.1)

Note that the number of the arithmetic operations on a normal convolution and its
dilated version is identical. In other words, dilated convolution does not change
the computational complexity of the ConvNet. Likewise, we can implement the
last fully connected layer using 1 × 1 × 2 filters. Using this formulation, we are
able to implement the sliding window method in terms of convolution layers. The
architecture of the sliding window ConvNet is shown in Fig. 6.5.
The output of the fully convolutional ConvNet is a y  = R1012×592×2 where
the patch at location (m, n) is a traffic sign if y  (m, n, 0) > y  (m, n, 1). It is a
common practice in the sliding window method to process patches every 2 pixels.
This is easily implementable by changing the stride of the pooling layer of the fully

6.3 Implementing Sliding Window Within the ConvNet

241

Fig. 6.5 Architecture of the sliding window ConvNet

Fig. 6.6 Detection score computed by applying the fully convolutional sliding network to 5 scales
of the high-resolution image

convolutional ConvNet to 2 and adjusting the dilation factor of convolution kernel
in the first fully connected layer accordingly (i.e., s = 3). Finally, to implement the
multi-scale sliding window, we only need to create different scales of the original
image and apply the sliding window ConvNet ot it. Figure 6.6 shows the detection
score computed by the detection ConvNet on high-resolution images.

242

6 Detecting Traffic Signs

Table 6.1 Time-to-completion (milliseconds) of the sliding ConvNet computed on 5 different
image scales
Layer name

Per layer time to completion in milliseconds for different scales
480 × 240

204 × 120

Data

1020 × 600
0.167

816 × 480
0.116

612 × 360
0.093

0.055

0.035

Conv

7.077

4.525

2.772

1.364

0.321

Relu

1.744

1.115

0.621

0.330

0.067

Pooling

6.225

3.877

2.157

1.184

0.244

Fully connected

8.594

5.788

3.041

1.606

0.365

Relu

2.126

1.379

0.746

0.384

0.081

Classify

1.523

0.893

0.525

0.285

0.101

27.656

17.803

10.103

5.365

1.336

Total

Fig. 6.7 Time to completion of the sliding ConvNet for different strides. Left time to completion
per resolution and Right cumulative time to completion

Detecting traffic signs in high-resolution images is the most time-consuming part
of the processing pipeline. For this reason, we executed the sliding ConvNet on a
GeForce GTX 980 card and computed the time to completion of each layer separately.
To be more specific, each ConvNet repeats the forward pass 100 times and the average
time to completion of each layer is computed. The condition of the system is fixed
during all calculations. Table 6.1 shows the results of the sliding ConvNet on 5
different scales. Recall from our previous discussion that the stride of the pooling
layer is set to 2 in the sliding ConvNet.
We observe that applying the sliding ConvNet on 5 different scales of a highresolution image takes 62.266 ms in total which is equal to processing of 19.13
frames per second. We also computed time to completion of the sliding ConvNet
by changing the stride of the pooling layer from 1 to 4. Figure 6.7 illustrates time to
completion per image resolution (left) as well as the cumulative time to completion.
The results reveal that it is not practical to set the stride to 1 since it takes 160 ms
to detect traffic sign on an image (6.25 frames per second). In addition, it consumes
a considerable amount of GPU memory. However, it is possible to process 19 frames
per second by setting the stride to 2. In addition, the reduction in the processing time
between stride 3 and stride 4 is negligible. But, stride 3 is preferable compared with

6.3 Implementing Sliding Window Within the ConvNet

243

Fig. 6.8 Distribution of
traffic signs in different
scales computed using the
training data

stride 4 since it produces a denser detection score. Last but not least, it is possible to
apply a combination of stride 2 and stride 3 in various scales to improve the overall
time to completion. For instance, we can set the stride to 3 for first scale and set it to
2 for rest of the image scales. By this way, we can save about 10 ms per image. The
execution time can be further improved by analyzing the database statistics.
More specifically, traffic signs bounded in 20 × 20 regions will be detected in the
first scale. Similarly, signs bounded in 50×50 regions will be detected in the 4th scale.
Based on this fact, we divided traffic signs in training set into 5 groups according to
the image scale they will be detected. Figure 6.8 illustrates the distribution of traffic
signs in each scale.
According to this distribution, we must expect to detect 20×20 traffic signs inside
a small region in the first scale. That said, the region between row 267 and row 476
must be analyzed to detect 20 × 20 signs rather than whole 600 rows in the first scale.
Based on the information depicted in the distribution of signs, we process only fetch
the 945 × 210, 800 × 205, 600 × 180 and 400 × 190 pixels in the first 4 scales to the
sliding ConvNet. As it is illustrated by a black line in Fig. 6.7, this reduces the time
to completion of the ConvNet with stride 2 to 26.506 ms which is equal to processing
37.72 high-resolution frames per second.

6.4 Evaluation
The detection ConvNet is trained using the mini-batch stochastic gradient descent
(batch size = 50) with learning rate annealing. We fix the learning rate to 0.02,
momentum to 0.9, L2 regularization to 10−5 , step size of annealing to 104 , annealing
rate to 0.8, the negative slope of the LReLU to 0.01 and the maximum number of
iterations to 150,000. The ConvNet is first trained using the ground-truth bounding
boxes (the blue boxes) and the negative samples collected randomly from image.

244

6 Detecting Traffic Signs

threshold 0.01 0.1 0.2
0.3 0.4 0.5
0.6
0.7
0.8
0.9 0.99
precision 94.85 98.78 99.31 99.53 99.65 99.72 99.77 99.84 99.89 99.93 99.98
recall
99.57 98.96 98.61 98.25 97.92 97.62 97.23 96.99 96.49 95.69 92.74

Fig. 6.9 Top precision-recall curve of the detection ConvNet along with models obtained by HOG
and LBP features. Bottom Numerical values (%) of precision and recall for the detection ConvNet

Then, a hard-negative mining is performed on the training set in order to collect
more organized negative samples and the ConvNet is trained again. To compare
with handcrafted features, we also trained detection models using HOG and LBP
features and Random Forests by following the same procedure. Figure 6.9 illustrates
the precision-recall plot of these models.
Average precision (AP) of the sliding ConvNet is 99.89% which indicates a nearly
perfect detector. In addition, average precision of the models based on HOG and
LBP features are 98.39 and 95.37%, respectively. Besides, precision of the sliding
ConvNet is considerably higher than HOG and LBP features. In other words, the
number of the false-positive samples in the sliding ConvNet is less than HOG and
LBP. It should be noted that false-positive results will be directly fetched into the
classification ConvNet where they will be classified into one of traffic sign classes.
This may produce dangerous situations in the case of autonomous cars. For example,
consider a false-positive result produced by detection module of an autonomous car
is classified as “speed limit 100” in an educational zone. Clearly, the autonomous car
may increase the speed according to the wrongly detected sign. This may have vital
consequences in real world. Even though average precision on the sliding ConvNet
and HOG models are numerically comparable, using the sliding ConvNet is certainly
safer and more applicable than the HOG model.
Post-processing bounding boxes: One solution to deal with the false-positive
results of the detection ConvNet is to post-process the bounding boxes. The idea
is that if a bounding box is classified positive, all the bounding boxes in distance

6.4 Evaluation

245

Fig. 6.10 Output of the detection ConvNet before and after post-processing the bounding boxes.
A darker bounding box indicate that it is detected in a lower scale image

of {−1, 0, −1} × {−1, 0, −1} must be also classified positive. In other words, if a
region of the image consists of a traffic sign, there must be at least 10 bounding
boxes over that region in which the detection ConvNet classifies them positive. By
only applying this technique, the false-positive rate can be considerably reduces.
Figure 6.10 illustrates the results of the detection ConvNet on a few images before
and after post-processing the bounding boxes.
In general, the detection ConvNet is able to locate traffic signs with high precision
and recall. Furthermore, post-processing the bounding boxes is able to effectively

246

6 Detecting Traffic Signs

discard the false-positive results. However, a few false-positive bounding boxes may
still exist in the result. In practice, we can create a second step verification by creating
a ConvNet with more complexity and apply it on the results from the detection
ConvNet in order to remove all the false-positive results.

6.5 Summary
Object detection is one of the hard problems in computer vision. It gets even harder in
time-demanding tasks such as ADAS. In this chapter, we explained a convolutional
neural network that is able to analyze high-resolution images in real time and it
accurately finds traffic signs. We showed how to quantitatively analyze the networks
and visualize it using an embedding approach.

6.6 Exercises
6.1 Read the documentation of dilation from caffe.proto file and implement the
architecture mentioned in this chapter.
6.2 Tweak the number of filters and neurons in the fully connected layer and train
the networks. Is there a more compact architecture that can be used for accurately
detecting traffic sign?

References
Girshick R, Donahue J, Darrell T, Berkeley UC, Malik J (2014) Rich feature hierarchies for accurate
object detection and semantic segmentation. doi:10.1109/CVPR.2014.81, arXiv:abs/1311.2524
Ouyang W, Wang X, Zeng X, Qiu S, Luo P, Tian Y, Li H, Yang S, Wang Z, Loy CC, Tang X (2015)
DeepID-Net: deformable deep convolutional neural networks for object detection. In: Computer
vision and pattern recognition. arXiv:1412.5661
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) OverFeat: integrated
recognition, localization and detection using convolutional networks, pp 1–15. arXiv:1312.6229
Szegedy C, Toshev A, Erhan D (2013) Deep neural networks for object detection. In: Advances in
neural information processing systems (NIPS). IEEE, pp 2553–2561. http://ieeexplore.ieee.org/
stamp/stamp.jsp?tp=&arnumber=6909673

7

Visualizing Neural Networks

7.1 Introduction
A neural network is a method that transforms input data into a feature space through a
highly nonlinear function. When a neural network is trained to classify input patterns,
it learns a transformation function from input space to the feature space such that
patterns from different classes become linearly separable. Then, it trains a linear
classifier on this feature space in order to classify input patterns. The beauty of
neural networks is that they simultaneously learn a feature transformation function
as well as a linear classifier.
Another method is to design the feature transformation function by hand and train
a linear or nonlinear classifiers to differentiate patterns in this space. Feature transformation functions such as histogram of oriented gradients and local binary pattern
histograms are two of commonly used feature transformation functions. Understanding the underlying process of these functions is more trivial than a transformation
function represented by a neural network.
For example, in the case of histogram of oriented gradients, if there are many
strong vertical edges in an image we know that the bin related to vertical edges is
going to be significantly bigger than other bins in the histogram. If a linear classifier
is trained on top of these histograms and if the magnitude of weight of linear classifier
related to the vertical bin is high, we can imply that vertical edges have a great impact
on the classification score.
As it turns out from the above example, figuring out that how a pattern is classified
using a linear classifier trained on top of histogram of oriented gradients is doable.
Also, if an interpretable nonlinear classifier such as decision trees or random forest
is trained on the histogram, it is still possible to explain how a pattern is classified
using these methods.

© Springer International Publishing AG 2017
H. Habibi Aghdam and E. Jahani Heravi, Guide to Convolutional
Neural Networks, DOI 10.1007/978-3-319-57550-6_7

247

248

7 Visualizing Neural Networks

The problem with deep neural networks is that it is hard or even impossible to
inspect weights of neural networks and understand how the feature transformation
function works. In other words, it is not trivial to know how a pattern with many
strong vertical edges will be transformed into the feature space. Also, in contrast to
histogram of oriented gradients where each axis in the feature spaces has an easyto-understand meaning for human, axes of feature spaces represented by a neural
network are not easily interpretable.
For these reasons, diagnosing neural networks and understanding the underlying
process of a neural network are not possible. Visualization is a way to make sense of
complex models such as neural networks. In Chap. 5, we showed a few data-oriented
techniques for understanding the feature transformation and classification process of
neural networks. In this chapter, we will briefly review these techniques again and
introduce gradient-based visualization techniques.

7.2 Data-Oriented Techniques
In general, data-oriented visualization methods work by feeding images to a network
and collecting information from desired neurons.

7.2.1 Tracking Activation
In this method, N images are fed into the network and the activation (i.e., output
of neuron after applying the activation function) of a specific neuron on each of
these images is stored in an array. This way, we will obtain an array containing N
real numbers in which each real number shows the activation of a specific neuron.
Then, K  N images with the highest activations are selected (Girshick et al.
2014). This method shows that what information about objects in the receptive field
of the neuron increases the activation of the neuron. In Chap. 5, we visualized the
classification network trained on the GTSRB dataset using this method.

7.2.2 Covering Mask
Assume that image x is correctly classified by a neural network with a probability
close to 1.0. In order to understand which parts of the image have a greater impact
on the score, we can run a multi-scale scanning window approach. In scale s and
at location (m, n) on x, x(m, n) and all pixels in its neighborhood are set to zero.
The size of neighborhood depends on s. This is equivalent zeroing the inputs to the
network. In other words, information in this particular part of the image is missing.
If the classification score highly depends on the information centered at (m, n), the
score must be dropped significantly by zeroing the pixels in this region. If the above
procedure is repeated for different scales and on all the locations in the image, we

7.2 Data-Oriented Techniques

249

will end up with a map for each scale where the value of map will be close to 1 if
zeroing its analogous region does not have any effect on the score. In contrast, the
value of map will be close to zero if zeroing its corresponding region has a great
impact on score. This method is previously used in Chap. 5 on the classification
network trained on GTSRB dataset. One problem with this method is that it could
be very time consuming to apply the above method on many samples for each class
to figure out which regions are important in the final classification score.

7.2.3 Embedding
Embedding is another technique which provides important information about feature
space. Basically, given a set of feature vectors Z = {Φ(x1 ), Φ(x2 ), . . . , Φ(x N )},
where Φ : R H ×W ×3 → Rd is the feature transformation function, the goal of
embedding is to find the mapping Ψ : Rd → Rd̂ to project the d-dimensional
feature vector into a d̂-dimensional space. Usually, d̂ is set to 2 or 3 since inspecting
vectors visually in this spaces can be easily done using scatter plots.
There are different methods for finding the mapping Ψ . However, there is a specific
mapping which is particularly used for mapping into two-dimensional space in the
field of neural network. This mapping is called t-distributed stochastic neighbor
embedding (t-SNE). It is a structure preserving mapping meaning that it tries to
preserve the structure of neighbors in the d̂-dimensional space as similar as possible
to the structure of neighbors in d-dimensional space. This is an important property
since it shows that how separable are patterns from different classes in the original
feature space.
Denoting the feature transformation function up to layer L in a network by Φ L (x),
we collect the set Z L = {Φ L (x1 ), Φ L (x2 ), . . . , Φ L (x N )} by feeding many images
from different classes to the network and collecting Φ L (x N ) for each image. Then, the
t-SNE algorithm is applied on Z L in order to find a mapping into the two-dimensional
space. The mapped points can be plotted using scatter plots. This technique was used
for analyzing networks in Chaps. 5 and 6.

7.3 Gradient-Based Techniques
Gradient-based methods explain neural networks in terms of their gradient with
respect to the input image x (Simonyan et al. 2013). Depending on how the gradients
are interpreted, a neural network can be studied from different perspectives.1

1 Implementations

of the methods in this chapter are available at github.com/pcnn/ .

250

7 Visualizing Neural Networks

7.3.1 Activation Maximization
Denoting the classification score of x on class c with Sc (x), we can find an input x̂
by maximizing the following objective function:
Sc (x̂) − λx̂22 ,

(7.1)

where λ is the regularization parameter defined by user. In other words, we are
looking for an input image x̂ that maximizes the classification score on class c and it
is always within n-sphere defined by the second term in the above function. This loss
can be implemented using a Python layer in the Caffe library. Specifically, the layer
accepts a parameter indicating the class of interest. Then, it will return the score of
class of interest during forward pass. In addition, in the backward pass derivative of
all classes except the class of interest will be set to zero. Obviously, any change in the
inputs of layer other than class of interest does not change the output. Consequently,
derivative of the loss with respect to these inputs will be equal to zero. In contrast,
derivative of loss with respect to class of interest will be equal to 1 since it just
passes the value from class of interest to the output. One can think of this loss as a
multiplexer which directs inputs according to its address.
The derivative of the second term in the objective function with respect to classification scores is always zero. However, derivative of the second term with respect
to input xi is equal to 2λxi . In order to formulate the above objective function as a
minimization problem, we can simply multiply the function with −1. In that case,
derivative of the first term with respect to the class of interest will be equal to −1.
Putting all this together, the Python layer for the above loss function can be defined
as follows:
class score_loss(caffe.Layer):
def setup(self, bottom, top):
params = eval(self.param_str)
self.class_ind = params[’class_ind’]
self.decay_lambda = params[’decay_lambda’ ] if params.has_key(’decay_lambda’) else 0

1
2
3
4
5
6

def reshape(self, bottom, top):
top[0].reshape(bottom[0].data.shape[0], 1)

7
8
9

def forward(self, bottom, top):
top[0].data[...] = 0
top[0].data[:, 0] = bottom[0].data[:, self.class_ind]

10
11
12
13

def backward(self, top, propagate_down, bottom):
bottom[0].diff[...] = np.zeros(bottom[0].data.shape)
bottom[0].diff[:, self.class_ind] = −1

14
15
16
17

if len(bottom) == 2 and self.decay_lambda > 0:
bottom[1].diff[...] = self.decay_lambda ∗ bottom[1].data[...]

18
19

After designing the loss layer, it has to be connected to the trained network. The
following Python script shows how to do this.

7.3 Gradient-Based Techniques

def create_net(save_to, class_ind):
L = caffe.layers
P = caffe.params
net = caffe.NetSpec()
net.data = L.Input(shape=[{’dim’:[1,3,48,48]}])
net.tran = L.Convolution(net.data,
num_output=3,
group=3,
kernel_size=1,
weight_filler={’type’:’constant’,
’value’:1},
bias_filler={’type’:’constant’,
’value’:0},
param=[{’decay_mult’:1},{’decay_mult’:0}],
propagate_down=True)
net.conv1, net.act1, net.pool1 = conv_act_pool(net.tran, 7, 100, act=’ReLU’)
net.conv2, net.act2, net.pool2 = conv_act_pool(net.pool1, 4, 150, act=’ReLU’, group=2)
net.conv3, net.act3, net.pool3 = conv_act_pool(net.pool2, 4, 250, act=’ReLU’, group=2)
net.fc1, net.fc_act, net.drop1 = fc_act_drop(net.pool3, 300, act=’ReLU’)
net.f3_classifier = fc(net.drop1, 43)
net.loss = L.Python(net.f3_classifier, net.data, module=’py_loss’, layer=’score_loss’,
param_str=‘‘{’class_ind’:%d, ’decay_lambda’:5}’’ %class_ind)
with open(save_to, ’w’) as fs:
s_proto = ’force_backward:true\n’ + str(net.to_proto())
fs.write(s_proto)
fs.flush()
print s_proto

251

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

Recall from Chap. 4 that the Python file has to be placed next to the network definition
file. We also set force_backward to true in order to force Caffe to always perform the
backward pass down to the data layer. Finally, the image x̂ can be found by running
the following momentum-based gradient descend algorithm.
caffe.set_mode_gpu()
root = ’/home/pc/’
net_name = ’ircv1’
save_to = root + ’cnn_{}.prototxt’.format(net_name)
class_ind = 1
create_net(save_to, class_ind)

1
2
3
4
5
6
7

net = caffe.Net(save_to, caffe.TEST)
net.copy_from(’/home/pc/cnn.caffemodel’)

8
9
10

im_mean = read_mean_file(’/home/pc/gtsr_mean_48x48.binaryproto’)
im_res = read_mean_file(’/home/pc/gtsr_mean_48x48.binaryproto’)
im_res = im_res[np.newaxis,...]/255.

11
12
13
14

alpha = 0.0001
momentum = 0.9
momentum_vec = 0

15
16
17
18

for i in xrange(4000):
net.blobs[’data’].data[...] = im_res[np.newaxis, ...]
net.forward()
net.backward()
momentum_vec = momentum ∗ momentum_vec + alpha ∗ net.blobs[’data’].diff
im_res = im_res − momentum_vec
im_res = np.clip(im_res, −1, 1)

19
20
21
22
23
24
25
26
27

fig1 = plt.figure(1, figsize=(6, 6), facecolor=’w’)
plt.clf()
res = np.transpose(im_res[0].copy()∗255+im_mean, [1, 2, 0])[:,:,[2,1,0]]
res = np.divide(res − res.min(), res.max()−res.min())
plt.imshow(res)

28
29
30
31
32

252

7 Visualizing Neural Networks

Lines 1–9 create a network with the Python layer connected to this network and
loads weights of the trained network into the memory. Line 11 loads the mean
image into memory. The variable in this line will be used for applying the backward
transformation on the result for illustration purposes. Lines 12 and 13 initialize the
optimization algorithm by setting it to the mean image.
Lines 15–17 configure the optimization algorithm. Lines 19–25 perform the
momentum-based gradient descend algorithm. Line 18 executes the forward pass
and the next line performs the backward pass and computes derivative of loss function with respect to the input data. Finally, the commands after the loop show the
obtained image. Figure 7.1 illustrates the result of running the above script on each
of classes, separately.
It turns out that classification score of each class mainly depends on pictograph
inside of each sign. Furthermore, shape of each sign has impact on the classification
score as well. Finally, we observe that the network does a great job in eliminating
the background of traffic sign.
It is worth mentioning that the optimization is directly done on the classification
scores rather than output of softmax function. The reason is that maximizing the
output of softmax may not necessarily maximize the score of class of interest. Instead,
it may try to reduce the score of other classes.

Fig. 7.1 Visualizing classes of traffic signs by maximizing the classification score on each class.
The top-left image corresponds to class 0. The class labels increase from left to right and top to
bottom

7.3 Gradient-Based Techniques

253

7.3.2 Activation Saliency
Another way for visualizing neural networks is to asses how sensitive is a classification score with respect to every pixel on the input image. This is equivalent
to computing gradient of the classification score with respect to the input image.
Formally, given the image x ∈ R H ×W ×3 belonging to class c, we can compute:
∇xmnk =

δSc (x)
,
xmnk

m = 0, . . . , H, n = 0, . . . , W, k = 0, 1, 2.

(7.2)

In this equation, xmnk R H ×W ×3 stores the gradient of classification score with
respect to every pixel in x. If x is a grayscale image the output will only have one
channel. Then, the output can be illustrated by mapping each gradient to a color. In
the case that x is a color image, maximum of x is computed across channels.
∇xmn = max xmnk .
k=0,1,2

(7.3)

Then, ∇xmn is illustrated by mapping each element in this matrix to a color. This
roughly shows saliency of each pixel in x. Figure 7.2 visualizes the class saliency of
a random sample from each class.
In general, we see that the pictograph region in each image has a great effect on the
classification score. Besides, in a few cases, we also observe that background pixels
have impact on the classification score. However, this might not be generalized to
all images in the same class. In order to understand expected saliency of pixel, we
can compute x for many samples from the same class and compute their average.
Figure 7.3 shows expected class saliency obtained by computing the average of class
saliency of 100 samples coming from the same class.

Fig. 7.2 Visualizing class saliency using a random sample from each class. The order of images is
similar Fig. 7.1

254

7 Visualizing Neural Networks

Fig.7.3 Visualizing expected class saliency using 100 samples from each class. The order of images
is similar to Fig. 7.1

The expected saliency reveals that the classification score mainly depends on
pictograph region. In other words, slight changes in this region may dramatically
change the classification score which in turn may alter the class of image.

7.4 Inverting Representation
Inverting a neural network (Mahendran and Vedaldi 2015) is a way to roughly know
what information is retained by a specific layer in a neural network. Denoting the
representation produced by L th layer in a ConvNet for the input image x with Φ(x) L ,
inverting a ConvNet can be done by minimizing
x̂ = arg min Φ(x ) − Φ(x)2 + λx  p ,
p

(7.4)

x ∈R H ×W ×3

where the first term computes the Euclidean distance between the representations of
the source image x and reconstructed image x and the second term regularizes the
cost by the p-norm of the reconstructed image.
If the regularizing term is omitted, it is possible to design a network using available
layers in Caffe which accepts the representation of an image and tries to find the
reconstructed image x̂. However, it is not possible to implement the above cost
function including the second term using available layers in Caffe. For this reason,
a Python layer has to be implemented for computing the loss and its gradient with
respect to its bottoms. This layer could be implemented as follows:

7.4 Inverting Representation

class euc_loss(caffe.Layer):
def setup(self, bottom, top):
params = eval(self.param_str)
self.decay_lambda = params[’decay_lambda’] if params.has_key(’decay_lambda’) else 0
self.p = params[’p’] if params.has_key(’p’) else 2

255

1
2
3
4
5
6

def reshape(self, bottom, top):
top[0].reshape(bottom[0].data.shape[0], 1)

7
8
9

def forward(self, bottom, top):

10
11

if bottom[0].data.ndim == 4:
top[0].data[:, 0] = np.sum(np.power(bottom[0].data−bottom[1].data,2), axis=(1,2,3))
elif bottom[0].data.ndim == 2:
top[0].data[:, 0] = np.sum(np.power(bottom[0].data − bottom[1].data, 2), axis=1)

12
13
14
15
16

if len(bottom) == 3:
top[0].data[:,0] += np.sum(np.power(bottom[2].data,2))

17
18
19

def backward(self, top, propagate_down, bottom):
20
bottom[0].diff[...] = bottom[0].data − bottom[1].data
21
if len(bottom) == 3:
22
bottom[2].diff[...] = self.decay_lambda ∗self.p∗ np.multiply(bottom[2].data[...], np.power(np.abs(bottom 23
[2].data[...]),self.p−2))

Then, the above loss layer is connected to the network trained on the GTSRB dataset.

def create_net_ircv1_vis(save_to):
L = caffe.layers
P = caffe.params
net = caffe.NetSpec()
net.data = L.Input(shape=[{’dim’:[1,3,48,48]}])
net.rep = L.Input(shape=[{’dim’: [1, 250, 6, 6]}]) #output shape of conv3

1
2
3
4
5
6
7

net.tran = L.Convolution(net.data,
num_output=3,
group=3,
kernel_size=1,
weight_filler={’type’:’constant’,
’value’:1},
bias_filler={’type’:’constant’,
’value’:0},
param=[{’decay_mult’:1},{’decay_mult’:0}],
propagate_down=True)
net.conv1, net.act1, net.pool1 = conv_act_pool(net.tran, 7, 100, act=’ReLU’)
net.conv2, net.act2, net.pool2 = conv_act_pool(net.pool1, 4, 150, act=’ReLU’, group=2)
net.conv3, net.act3, net.pool3 = conv_act_pool(net.pool2, 4, 250, act=’ReLU’, group=2)
net.fc1, net.fc_act, net.drop1 = fc_act_drop(net.pool3, 300, act=’ReLU’)
net.f3_classifier = fc(net.drop1, 43)
net.loss = L.Python(net.act3, net.rep, net.data, module=’py_loss’, layer=’euc_loss’,
param_str="{’decay_lambda’:10,’p’:6}")

8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

The network accepts two inputs. The first input shows the reconstructed image and the
second input indicates the representation of the source image. In the above network,
our goal is to reconstruct the image using representation produced by the activation
of the third convolution layer. The output shape of the third convolution layer is
250 × 3 × 3. Hence, the shape of second input in the network is set to 1 × 250 × 6 × 6.
Moreover, as it is proposed in Mahendran and Vedaldi (2015), we set the value of p

256

7 Visualizing Neural Networks

in the above network to 6. Having the network created, we can execute the following
momentum-based gradient descend for finding x̂.
im_mean = read_mean_file ( ’ / home / pc / gtsr_mean_48x48 . b i n a r y p r o t o ’ )
im_mean = np . t r a n s p o s e ( im_mean , [ 1 , 2 , 0 ] )

1
2
3

im = cv2 . imread ( ’ / home / pc /GTSRB/ Training_CNN /00016/ crop_00001_00029 . ppm ’ )
im = cv2 . r e s i z e ( im , (48 ,48) )
im_net = ( im . a s t y p e ( ’ f l o a t 3 2 ’ )−im_mean ) / 2 5 5 .
n e t . blobs [ ’ d a t a ’ ] . d a t a [ . . . ] = np . t r a n s p o s e ( im_net , [ 2 , 0 , 1 ] ) [ np . newaxis , . . . ]

4
5
6
7
8

n e t . forward ( )
rep = n e t . blobs [ ’ a c t 3 ’ ] . d a t a . copy ( )

9
10
11
12

im_res = im∗0
im_res = np . t r a n s p o s e ( im_res , [ 2 , 0 , 1 ] )

13
14
15

alpha = 0.000001
momentum = 0.9
momentum_vec = 0

16
17
18
19

for i in xrange (10000) :
n e t . blobs [ ’ d a t a ’ ] . d a t a [ . . . ] = im_res [ np . newaxis , . . . ]
n e t . blobs [ ’ rep ’ ] . d a t a [ . . . ] = rep [ . . . ]

20
21
22
23

n e t . forward ( )
n e t . backward ( )

24
25
26

momentum_vec = momentum ∗ momentum_vec − alpha ∗ n e t . blobs [ ’ d a t a ’ ] . d i f f

27
28

im_res = im_res + momentum_vec
im_res = np . c l i p ( im_res , −1, 1)

29
30
31

plt . figure (1)
plt . clf ()
r e s = np . t r a n s p o s e ( im_res [ 0 ] . copy ( ) , [ 1 , 2 , 0 ] )
r e s = np . c l i p ( r e s ∗255 + im_mean , 0 , 255)
r e s = np . d i v i d e ( r e s − r e s . min ( ) , r e s .max( )− r e s . min ( ) )
p l t . imshow ( r e s [ : , : , [ 2 , 1 , 0 ] ] )
p l t . show ( )

32
33
34
35
36
37
38

In the above code, the source image is first fed to the network and the output of
the third convolution layer is copied into memory. Then, the optimization is done in
10,000 iterations. At each iteration, the reconstructed image is entered to the network
and the backward pass is computed down to the input layer. This way, gradient of the
loss function is obtained with respect to the input. Finally, the reconstructed image
is updated using the momentum gradient descend rule. Figure 7.4 shows the result of
inverting the classification network from different layers. We see that the first convolution layer keeps photo-realistic information. For this reason, the reconstructed
image is very similar to the source image. Starting from the second convolution
layer, photo-realistic information starts to vanish and they are replaced with parts of
image which is important to the layer. For example, the fully connected layer mainly
depends on the specific part of pictograph on the sign and it ignores background
information.

7.5 Summary

257

Fig. 7.4 Reconstructing a traffic sign using representation of different layers

7.5 Summary
Understanding behavior of neural networks is necessary in order to better analyze
and diagnose them. Quantitative metrics such as classification accuracy and F1 score
just give us numbers indicating how good is the classifier in our problem. They do not
tell us how a neural network achieves this result. Visualization is a set of techniques
that are commonly used for understanding structure of high-dimensional vectors.
In this chapter, we briefly reviewed data-driven techniques for visualization and
showed that how to apply them on neural networks. Then, we focused on techniques
that visualize neural networks by minimizing an objective function. Among them,
we explained three different methods.
In the first method, we defined a loss function and found an image that maximizes
the classification score of a particular class. In order to generate more interpretable
images, the objective function was regularized using L 2 norm of the image. In the
second method, gradient of a particular neuron was computed with respect to the
input image and it is illustrated by computing its magnitude.
The third method formulated the visualizing problem as an image reconstruction
problem. To be more specific, we explained a method that tries to find an image
in which the representation of this image is very close to the representation of the
original image. This technique usually tells us what information is usually discarded
by a particular layer.

7.6 Exercises
7.1 Visualizing a ConvNet can be done by maximizing the softmax score of a specific class. However, this may not exactly generate an image that maximizes the
classification score. Explain the reason taking into account the softmax score.
7.2 Try embed a feature extracted by neural network using local linear embedding
method.
7.3 Use Isomap to embed features into two-dimensional space.

258

7 Visualizing Neural Networks

7.4 Assume an image of traffic signs belonging to class c which is correctly classified
by the ConvNet. Instead of maximizing Sc (x), try to minimize directly Sc (x) such
that x is no longer classified correctly by ConvNets but it is still easily recognizable
for human.

References
Girshick R, Donahue J, Darrell T, Berkeley UC, Malik J (2014) Rich feature hierarchies for accurate
object detection and semantic segmentation. doi:10.1109/CVPR.2014.81, arXiv:abs/1311.2524
Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them.
In: Computer vision and pattern recognition. IEEE, Boston, pp 5188–5196. doi:10.1109/CVPR.
2015.7299155, arXiv:abs/1412.0035
Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside convolutional networks: visualising image
classification models and saliency maps, pp 1–8. arXiv:13126034

A

Appendix
Gradient Descend

Any classification model such as neural networks is trained using an objective function. The goal of an objective function is to compute a scaler based on training data
and current configuration of parameters of the model. The scaler shows how good
is the model in classifying training samples. Assume that the range of objective
function is in internal [0, inf) where it returns 0 for a model that classifies training
samples perfectly. As the error of model increases, the objective function returns a
larger positive number.
Let Φ(x; θ ) denotes a model which classifies the sample x ∈ Rd . The model is
defined using its parameter vector θ ∈ Rq . Based on that, a training algorithm aims to
find θ̂ such that the objective function returns a number close to zero given the model
Φ(.; θ̂ ). In other words, we are looking for a parameter vector θ̂ that minimizes the
objective function.
Depending on the objective function, there are different ways to find θ̂. Assume
that the objective function is differentiable everywhere. Closed-form solution for
finding minimum of the objective function is to set its derivative to zero and solve
the equation. However, in practice, finding a closed-form solution for this equation is
impossible. The objective function that we use for training a classifier is multivariate
which means that we have to set the norm of gradient to zero in order to find the
minimum of objective function. In a neural network with 1 million parameters, the
gradient of objective function will be a 1-million-dimensional vector. Finding a
closed-form solution for this equation is almost impossible.
For this reason, we always use a numerical method for finding the (local) minimum
of objective function. Like many numerical methods, this is also an iterative process.
The general algorithm for this purpose is as follows: The algorithm always starts
from an initial point. Then, it iteratively updates the initial solution using vector δ.
The only unknown in the above algorithm is the vector δ. A randomized hill climbing

© Springer International Publishing AG 2017
H. Habibi Aghdam and E. Jahani Heravi, Guide to Convolutional
Neural Networks, DOI 10.1007/978-3-319-57550-6

259

260

Appendix A: Gradient Descend

Algorithm 1 Numerical optimization
x  ← random vector
while stopping condition do
x = x + δ
return x 

algorithm sets δ to a random vector.1 However, it is not guaranteed that the objective
function will be constantly minimized by this way. Hence, its convergence speed
might not be fast especially when the dimensionality of the parameter vector θ is
high. For this reason, we need a better heuristic for finding δ.
Assume that you are in the middle of hill and you want to go down as quickly as
possible. There might be many paths that you can choose as your next step. Some of
them will get you closer to the bottom of the hill. There is only one move that will
get you much more closer than all other moves. It is the move exactly along steepest
descend. In the above algorithm, the hill is the objective function. Your current
location on the hill is analogous to x  in the above algorithm. Steepest descend is
also related to δ.
From mathematical perspective, steepest descend is related to gradient of function
in the current location. Gradient descend is an iterative optimization algorithm to find
a local minimum of a function. It sets δ proportional to the negative of gradient. The
following table shows the pseudocode of the gradient descend algorithm:
Algorithm 2 Gradient descend algorithm
x  ← random vector
while stopping condition do
x  = x  − α  J (θ; x)
return x 

In this algorithm, α is the learning rate and  denotes the gradient of the objective
function J (.) with respect to θ . The learning rate α shows how big should be the next
step in the direction of steepest descend. In addition, the stopping condition might
be implemented by assuming a maximum number of iterations. Also, the loop can
be stopped if the changes of θ are less than a threshold. Let us explain this using an
example. Assume that our model is defined as follows:
Φ(x; θ ) = θ1 x1 + θ2 x2 ,

(A.1)

x  + δ if it reduces the objective function. Otherwise,
it rejects the current δ and generates new δ. The above process is repeated until the stopping criteria
are reached.

1 A randomize hill climbing algorithm accepts

Appendix A: Gradient Descend

261

where x, θ ∈ R2 are two-dimensional vectors. We are given a dataset X =
{x 1 , x 2 , . . . , x n } and our goal is to minimize the following objective function:
J (θ t ) =

n
2
1 
Φ(x i ; θ t )
2n

(A.2)

i=1

In order to minimize J (.) using the gradient descend algorithm, we need to compute its gradient vector which is given by

δ J (θ t ) δ J (θ t )
,
δθ1
δθ2
 

n
n
1
1
x1 (θ1 x1 + θ2 x2 ),
x2 (θ1 x1 + θ2 x2 ) .
n
n


J (θ t ) =

i=1

(A.3)

i=1

Since J (.) is a two-dimensional function it can be illustrated using filled contour
plots. To be more specific, the dataset X is fixed and the variables of this function
are θ1 and θ2 . Therefore, we can evaluate J for different values of θ and show it
using a contour plot. The following Python script plots this function:
def J (x, w) :
e = (np. dot(x, w. transpose () ) ) ∗∗ 2
return np.mean(e , axis = 0)

1
2
3
4

def dJ(x, w) :
return np.mean(x∗(x∗ w. transpose () ) , axis = 0)

5
6
7

x1, x2 = np. meshgrid(np. linspace(−5, 5, 100) ,np. linspace(−5, 5, 100) , indexing=’ i j ’ )
x1x2 = np. stack ((x1. flatten () , x2. flatten () ) , axis = 1)

8
9
10

w1,w2 = np. meshgrid(np. linspace( −0.9, 0.9 , 50) ,np. linspace( −0.9, 0.9 , 50) , indexing = ’ i j ’ )
w = np. stack ((w1. flatten () , w2. flatten () ) , axis = 1)

12

e = J (x1x2, w)

14

plt . figure (1 , figsize =(9,8) , facecolor =‘w’ )
plt . contourf (w1,w2, np. reshape(e , w1. shape) ,50)
plt . colorbar ()
plt .show()

16

11
13
15
17
18
19

If we execute the above code, we will obtain the result illustrated in Fig. A.1.
Since we know J (θ t ), we can plug it into the gradient descend algorithm and
find the minimum of J by constantly updating θ t until the algorithm converges. The
following Python script shows how to do this:
w_sol = np. asarray ([0.55 , 0.50])
batch_size = x1x2. shape[0]
for _ in xrange(50) :
x = x1x2

1
2
3
4
5

e = J (x, w)
de = dJ(x, w_sol)

6
7
8

w_sol = w_sol − alpha ∗ de

9

262

Appendix A: Gradient Descend

Fig. A.1 Surface of the error
function in (A.2)

The algorithm starts with initializing θ to [0.55, 0.50]. Then, it executes the gradient calculation and parameters update for 50 iterations. At each iteration, the variable
w_sol slightly changes and moves toward minimum of J . Figure A.2 shows the trajectory of parameters in 50 iterations. The function is more steep at initial location.
Hence, the parameters are updated using bigger steps. However, as the parameters
approach minimum of the function, the gradient becomes smaller. For this reason,
the parameters are updated using smaller steps. Assume we change Φ(x; θ ) to
Fig. A.2 Trajectory of
parameters obtained using
the gradient descend
algorithm

Appendix A: Gradient Descend

263

Fig. A.3 Surface of J (.)
using Φ in (A.4)

Φ(x; θ ) = θ1 x1 + θ2 x2 − θ13 x1 + θ23 x2 .

(A.4)

Figure A.3 illustrates the surface of J (.) using the above definition for Φ(.). In
contrast to the previous definition of Φ, the surface of J (.) with the new definition
of Φ is multi-modal. In other words, the surface is not convex anymore.
An immediate conclusion from a non-convex function is that there are more than
one local minimum in the function. Consequently, depending on the initial location
on the surface of J (.), trajectory of the algorithm might be different. This property
is illustrated in Fig. A.4.
As it is clear from the figure, although initial solutions are very close to each
other, their trajectory are completely different and they have converged to distinct
local minimums. Sensitivity of the gradient descend algorithm to the initial solutions
is an inevitable issue. For a linear classifier, J () is a convex function of the parameter
vector θ . However, for models such as multilayer feed-forward neural networks, J (.)
is a non-convex function. Therefore, depending on the initial value of θ , the gradient
descend algorithm is likely to converge to different local minimums.
Regardless of the definition of Φ, the gradient descend algorithm applied on the
above definition of J () is called vanilla gradient descend or batch gradient descend.
In general, the objective function J can be defined in three different ways:
J (θ ) =

n

1 
L Φ(x i ; θ )
n
i=1

(A.5)

264

Appendix A: Gradient Descend

Fig. A.4 Depending on the location of the initial solution, the gradient descend algorithm may
converge toward different local minimums

J (θ ) =

m




L Φ(x i ; θ )

m = {1, . . . , n − k}

(A.6)

i=m

J (θ ) =

m+k

1 
L Φ(x i ; θ )
k

k  n, m = {1, . . . , n − k}.

(A.7)

i=m

In the above equations, L is a loss function which computes the loss of Φ given
the vector x i . We have explained different loss functions in Chap. 2 that can be used in
the task of classifications. The only difference in the above definitions is the number
of iterations in the summation operation. The definition in (A.5) sums the loss over
all n samples in training set. This is why it is called batch gradient descend. As the
δJ
is also computed over all the samples in the training set. Assume we want
result, δθ
j
to train a neural network with 10M parameters on a dataset containing 20M samples.
Suppose that computing the gradient on one sample takes 0.002 s. This means that
it will take 20M × 0.002 = 40,000 s (11 h) in order to compute (A.5) and do a
single update on parameters. Parameters of a neural network may require thousands
of updates before converging to a local minimum. However, this is impractical to do
using (A.5).
The formulation of J in (A.6) is called stochastic gradient descend and it computes
the gradient only on one sample and updates the parameter vector θ using the gradient
over a single sample. Using this formulation, it is possible to update parameters
thousand times in a tractable period. The biggest issue with this formulation is the
fact that only one sample may not represent the error surface with an acceptable
precision. Let us explain this on an example. Assume the formulation of Φ in (A.4).
We showed previously the surface of error function (A.5) in Fig. A.3. Now, we

Appendix A: Gradient Descend

265

Fig. A.5 Contour plot of (A.6) computed using three different samples

compute surface of (A.6) using only three samples in the training set rather than all
samples. Figure A.5 illustrates the contour plots associated with each sample.
As we expected, a single sample is not able to accurately represent the error
surface. As the result, δθJj might be different if it is computed on two different samples.
Therefore, the magnitude and direction of parameter update will highly depend on the
sample at current iteration. For this reason, we expect that the trajectory of parameters
update to be jittery. Figure A.6 shows the trajectory of the stochastic gradient descend
algorithm.
Compared with the trajectory of the vanilla gradient descend, trajectory of the
stochastic gradient descend is jittery. From statistical point of view, if we take into
account the gradients of J along its trajectory, the gradient vector of the stochastic gradient descend method has a higher variance compared with vanilla gradient
descend. In highly nonlinear functions such as neural networks, unless the learning

Fig. A.6 Trajectory of
stochastic gradient descend

266

Appendix A: Gradient Descend

Fig. A.7 Contour plot of the mini-batch gradient descend function for mini-batches of size 2 (left),
10 (middle) and 40 (right)

rate is adjusted carefully, this causes the algorithm to jump over local minimums
several times and it may take a longer time to the algorithm to converge. Adjusting
learning rate in stochastic gradient descend is not trivial and for this reason stochastic gradient descend is not used in training neural networks. On the other hand,
minimizing the vanilla gradient descend is not also tractable.
The trade-off between vanilla gradient descend and stochastic gradient descend
is (A.7) that is called mini-batch gradient descend. In this method, the objective
function is computed over a small batch of samples. The size of batch is much
smaller than the size of samples in the training set. For example, k in this equation
can be set to 64 showing a batch including 64 samples. We computed the error surface
for mini-batches of size 2, 10, and 40 in our example. Figure A.7 shows the results.
We observe that a small mini-batch is not able to adequately represent the error
surface. However, the error surface represented by larger mini-batches are more
accurate. For this reason, we expect that the trajectory of mini-batch gradient descend
becomes smoother as the size of mini-batch increases. Figure A.8 shows the trajectory
of mini-batch gradients descend method for different batch sizes.
Depending of the error surface, accuracy of error surface is not improved significantly after a certain mini-batch size. In other words, using a mini-batch of size 50
may produce the same result as the mini-batch of size 200. However, the former size

Fig. A.8 Trajectory of the mini-batch gradient descend function for mini-batches of size 2 (left),
10 (middle), and 40 (right)

Appendix A: Gradient Descend

267

is preferable since it converges faster. Currently, complex models such as neural networks are trained using mini-batch gradient descend. From statistical point of view,
variance of gradient vector in mini-batch gradient descend is lower than stochastic
gradient descend but it might be higher than batch gradient descend algorithm. The
following Python script shows how to implement the mini-batch gradient descend
algorithm in our example.
def J (x , w) :
e = (np . dot (x , w. transpose ( ) ) − np . dot (x , w. transpose ( ) ∗∗ 3) ) ∗∗ 2
return np .mean( e , axis =0)

1
2
3
4

def dJ (x , w) :
return np .mean( ( x−3∗x∗w. transpose ( ) ∗∗2) ∗((x∗ w. transpose ( ) ) − (x∗ w. transpose
( ) ∗∗ 3) ) , axis =0)

5
6
7
8

x1 , x2 = np . meshgrid (np . linspace ( − 5,5,100) ,np . linspace ( − 5,5,100) , indexing=’ i j ’ )
x1x2 = np . stack ( ( x1 . f l a t t e n ( ) , x2 . f l a t t e n ( ) ) , axis =1)

9
10
11

w1,w2 = np . meshgrid (np . linspace ( − 0.9 ,0.9 ,50) ,np . linspace ( − 0.9 ,0.9 ,50) , indexing=’ i j ’ )
w = np . stack ( (w1. f l a t t e n ( ) , w2. f l a t t e n ( ) ) , axis =1)

12
13
14

seed (1234)
ind = range (x1x2 . shape [ 0 ] )
shuffle ( ind )

15
16
17
18

w_sol = np . asarray ([0.55 , 0.50])

19
20

alpha = 0.02
batch_size = 40

22

start_ind = 0

24

for _ in xrange(50) :
end_ind = min(x1x2 . shape [ 0 ] , s t a r t _ i n d+batch_size )
x = x1x2[ ind [ s t a r t _ i n d : end_ind ] , : ]

26

21
23
25
27
28
29

i f end_ind >= x1x2 . shape [ 0 ] :
start_ind = 0
else :
s t a r t _ i n d += batch_size

30
31
32
33
34

de = dJ (x , w_sol )
w_sol = w_sol − alpha ∗ de

35
36

A.1 Momentum Gradient Descend
There are some variants of gradient descend algorithm to improve its convergence
speed. Among them, momentum gradient descend is commonly used for training
convolutional neural networks. The example that we have used in this chapter so
far has a nice property. All elements of input x have the same scale. However, in
practice, we usually deal with high-dimensional input vectors where elements of
these vectors may not have the same scale. In this case, the error surface is a ravine
where it is steeper in one direction than others. Figure A.9 shows a ravine surface
and trajectory of mini-batch gradient descend on this surface.

268

Appendix A: Gradient Descend

Fig. A.9 A ravine error surface and trajectory of mini-batch gradient descend on this surface

The algorithm oscillates many times until it converges to the local minimum.
The reason is that because learning rate is high, the solution jumps over the local
minimum after an update where the gradient varies significantly. In order to reduce
the oscillation, the learning rate can be reduced. However, it is not easy to decide
when to reduce the learning rate. If we set the learning rate to a very small value
from beginning the algorithm may not converge in a an acceptable time period. If
we set it to a high value it may oscillate a lot on the error surface.
Momentum gradient descend is a method to partially address this problem. It
keeps history of gradient vector from previous steps and update the parameter vector
θ based on the gradient of J with respect to the current mini-batch and its history on
previous mini-batches. Formally,
ν t = γ ν t−1 − α  J (θ t )
θ t+1 = θ t + ν t .

(A.8)

Obviously, the vector ν has the same dimension as α  J (θ t ). It is always initialized
with zero. The hyperparameter γ ∈ [0, 1) is a value between 0 and 1 (not included 1).
It has to be smaller than one in order to make it possible that the algorithm forgets the
gradient eventually. Sometimes the subtraction and addition operators are switched
in these two equations. But switching the operators does not have any effect on the
output. Figure A.10 shows the trajectory of the mini-batch gradient descend with
γ = 0.5.
We see that the trajectory oscillates much less using the momentum. The momentum parameter γ is commonly set to 0.9 but smaller values can be also assigned
to this hyperparameter. The following Python script shows how to create the ravine
surface and implement momentum gradient descend. In the following script, the size
of mini-batch is set to 2 but you can try with larges mini-batches as well.

Appendix A: Gradient Descend

269

Fig. A.10 Trajectory of
momentum gradient descend
on a ravine surface

def J (x, w) :
e = (np. dot(x, w. transpose () ) ) ∗∗ 2
return np.mean(e , axis = 0)

1
2
3
4

def dJ(x, w) :
return np.mean(x∗(x∗ w. transpose () ) , axis = 0)

5
6
7

x1, x2 = np. meshgrid(np. linspace(−5, 5, 100) ,np. linspace(−20, −15, 100) , indexing = ’ i j ’ )
x1x2 = np. stack ((x1. flatten () , x2. flatten () ) , axis = 1)

8
9
10

w1,w2 = np. meshgrid(np. linspace( − 0.9,0.9,50) ,np. linspace( −0.9, 0.9 , 50) , indexing = ’ i j ’ )
w = np. stack ((w1. flatten () , w2. flatten () ) , axis = 1)

11
12
13

seed(1234)
ind = range(x1x2. shape[0])
shuffle (ind)

16

w_sol = np. asarray([ −0.55, 0.50])

18

14
15
17
19

alpha = 0.0064
batch_size = 2

20
21
22

start_ind = 0

23
24

momentum = 0.5
momentum_vec = 0
for _ in xrange(50) :
end_ind = min(x1x2. shape[0] , start_ind+batch_size )
x = x1x2[ind[ start_ind : end_ind] , : ]

25
26
27
28
29
30

i f end_ind >= x1x2. shape[0]:
start_ind = 0
else :
start_ind += batch_size

31
32
33
34
35

de = dJ(x, w_sol)
momentum_vec = momentum_vec∗momentum + alpha∗de
w_sol = w_sol − momentum_vec

36
37
38

270

Appendix A: Gradient Descend

Fig. A.11 Problem of
momentum gradient descend

A.2 Nesterov Accelerated Gradients
One issue with momentum gradient descend is that when the algorithm is in the
path of steepest descend, the gradients are accumulated and momentum vector may
become bigger and bigger. It is like rolling a snow ball in a hill where it becomes
bigger and bigger. When the algorithm gets closer to the local minimum, it will
jump over the local minimum since the momentum has become very large. This is
where the algorithm takes a longer trajectory to reach to local minimum. This issue
is illustrated in Fig. A.11.
The above problem happened because the momentum gradient descend accumulates gradients blindly. It does not take into account what may happen in the next
steps. It realizes its mistake exactly in the next step and it tries to correct it after making a mistake. Nesterov accelerated gradient alleviates this problem by computing
the gradient of J with respect to the parameters in the next step. To be more specific,
θ t + γ ν t−1 approximately tells us where the next step is going to be. Based on this
idea, Nesterov accelerated gradient update rule is defined as
ν t = γ ν t−1 − α  J (θ t + γ ν t−1 )
θ t+1 = θ t + ν t .

(A.9)

By changing the update rule of vanilla momentum gradient descend to Nesterov
accelerated gradient, the algorithm has an idea about the next step and it corrects its
mistakes before happening. Figure A.12 shows the trajectory of the algorithm using
this method.

Appendix A: Gradient Descend

271

Fig. A.12 Nesterov
accelerated gradient tries to
correct the mistake by
looking at the gradient in the
next step

We see that the trajectory of Nesterov gradient descend is shorter than momentum gradient descend but it still has the same problem. Implementing the Nesterov
accelerated gradient is simple. We only need to replace Lines 56–58 in the previous
code with the following statements:
de_nes = dJ ( x , w_sol−momentum_vec∗momentum)
momentum_vec = momentum_vec ∗ momentum + alpha ∗ de_nes
w_sol = w_sol −momentum_vec

1
2
3

A.3 Adaptive Gradients (Adagrad)
The learning rate α is constant for all elements of J . One of the problems in
objective functions with ravine surfaces is that the learning rates of all elements are
equal. However, elements analogous to steep dimensions have higher magnitudes
and elements analogous to gentle dimensions have smaller magnitudes. When they
are all updated with the same learning rate, the algorithm makes a larger step in
direction of steep elements. For this reason, it oscillates on the error surface.
Adagrad is a method to adaptively assign a learning rate for each element in the
gradient vector based on the gradient magnitude of each element in the past. Let ωl
denotes sum of square of gradients along the l th dimension in the gradient vector.
Adagrad updates the parameter vector θ as
δ J (θ t )
α
θlt+1 = θlt − √
.
ωl + ε δθlt

(A.10)

272

Appendix A: Gradient Descend

Fig. A.13 Trajectory of
Adagrad algorithm on a
ravine error surface

In this equation, θl shows the l th element in the parameter vector. We can replace
Lines 56–58 in the previous script with the following statements:
de_nes = dJ ( x , w_sol−momentum_vec∗momentum) momentum_vec =
momentum_vec ∗ momentum + alpha ∗ de_nes w_sol = w_sol −
momentum_vec

1
2
3

The result of optimizing an objective function with a ravine surface is illustrated
in Fig. A.13. In contrast to the other methods, Adagrad generates a short trajectory
toward the local minimum.
The main restriction with the Adagrad algorithm is that the learning may rate
rapidly drop after a few iterations. This makes it very difficult or even impossible for
the algorithm to reach a local minimum in an acceptable time. This is due to the fact
that the magnitudes of gradients are accumulated over time. Since the magnitude is
obtained by computing the square of gradients, the value of ωl will always increase
at each iteration. As the result, the learning rate of each element will get smaller
and smaller at each iteration since ωl appears in the denominator. After certain
iterations, the adaptive learning rate might be very small and for this reason the
parameter updates will be negligible.

A.4 Root Mean Square Propagation (RMSProp)
Similar to Adagrad, Root mean square propagation which is commonly known as
RMSProp is a method for adaptively changing the learning rate of each element in the
gradient vector. However, in contrast to Adagrad where the magnitude of gradient is

Appendix A: Gradient Descend

273

always accumulated over time, RMSProp has a forget rate in which the accumulated
magnitudes of gradients are forgotten overtime. For this reason, the quantity ωl is not
always ascending but it may descend sometimes depending of the current gradients
and forget rate. Formally, RMSProp algorithm update parameters as follows:
ωlt = γ ωlt−1 + (1 − γ )
θlt+1

=

θlt

J (θ t )
δθlt

2

δ J (θ t )
α
−√
.
ωl + ε δθlt

(A.11)

In this equation, γ ∈ [0, 1) is the forget rate and it is usually set to 0.9. This can be
simply implemented by replacing Lines 56–58 in the above script with the following
statements:
de_rmsprop = dJ ( x , w_sol )
rmsprop_vec = rmsprop_vec∗rmsprop_gamma+(1−rmsprop_gamma ) ∗de_rmsprop∗∗2
w_sol = w_sol −(alpha / ( np . s q r t ( rmsprop_vec ) ) ) ∗de_rmsprop

1
2
3

Figure A.14 shows the trajectory of RMSProp algorithm on a ravine error surface
as well as nonlinear error surface. We see that the algorithm makes baby steps but it
has a short trajectory toward the local minimums.
In practice, most of convolutional neural networks are trained using momentum
batch gradient descend algorithm. But other algorithms that we mentioned in this
section can be also used for training a neural network.

Fig. A.14 Trajectory of RMSProp on a ravine surface (left) and a nonlinear error surface (right)
using mini-batch gradient descend

274

Appendix A: Gradient Descend

A.5 Shuffling
The gradient descend algorithm usually iterates over all samples several times before
converging to a local minimum. One epoch refers to running the gradient descend
algorithm on whole samples only one time. When mentioned that the error surface is
always approximated using one sample (stochastic gradient descend) or a few samples (mini-batch gradient descend), assume the i th and i + 1th mini-batch. Samples
in these two mini-batches have not changed compared to the previous epoch.
As the result, the error surface approximated by the i th in previous epoch is
identical to the current epoch. Samples in one mini-batch might not be distributed in
the input space properly and they may approximate the error surface poorly. Hence,
the gradient descend algorithm may take a longer time to converge or it may not even
converge in a tractable time.
Shuffling is a technique that shuffles all training samples at the end of one epoch.
This way, the error surface approximated by the i th mini-batch will be different in two
consecutive epochs. This in most cases improves the result of gradient descend algorithm. As it is suggested in Bengio (2012), shuffling may increase the convergence
speed.

Glossary

Activation function An artificial neuron applies a linear transformation on the
input. In order to make the transformation nonlinear, a nonlinear function is
applied on the output of the neuron. This nonlinear function is called activation
function.
Adagrad Adagrad is a method that is used in gradient descend algorithm to adaptively assign a distinct learning rate for each element in the feature vector. This is
different from original gradient descend where all elements have the same learning rate. Adagrad computes a learning rate for each element by dividing a base
learning rate by sum of square of gradients for each element.
Backpropagation Computing gradient of complex functions such as neural networks is not tractable using multivariate chain rule. Backpropagation is an algorithm to efficiently computing gradient of a function with respect to its parameters
using only one backward pass from the last node in computational graph to first
node in the graph.
Batch gradient descend Vanilla gradient descend which is also called batch gradient descend is a gradient-based method that computes the gradient of loss function using whole training samples. A main disadvantage of this method is that it
is not computationally efficient on training sets with many samples.
Caffe Caffe is a deep learning library written in C++ which is mainly developed for
training convolutional neural network. It supports computations on CPU as well
GPU. It also provides interfaces for Python and Matlab programming languages.
Classification score A value computed by wx + b in the classification layer. This
score is related to the distance of the sample from decision boundary.
Decision boundary In a binary classification model, decision boundary is a hypothetical boundary represented by the classification model in which points on one
side of the boundary are classified as 1 and points on the other side of the boundary
are classified as 0. This can be easily generalized to multiclass problems where
the feature space is divided into several regions using decision boundaries.
Depth of network Depth of deepest node in the corresponding computational
graph of the network. Note that depth of a network is not always equal to the
© Springer International Publishing AG 2017
H. Habibi Aghdam and E. Jahani Heravi, Guide to Convolutional
Neural Networks, DOI 10.1007/978-3-319-57550-6

275

276

Glossary

number of layers (computational nodes). The reason is that in networks such as
GoogleNet some of nodes have the same depth in computational graph.
Dropout Dropout is a simple but effective technique for regularizing a neural
network. It works by randomly dropping a neuron from the network in each
iteration of the training algorithm. This means that output and gradients of selected
neurons are set to zero so they do not have any impact on forward and backward
passes.
Early stopping It is a technique based on training and validation performance
to detect overfitting during and stop the training algorithm. For this purpose,
performance of the model is computed on a validation set as well as the training
set. When their difference exceeds a certain threshold, the training algorithm
stops. However, in some cases, even when the neural network starts to overfit on
the training data, the validation accuracy might be still ascending. In that case,
we may not stop the training algorithm.
Feature space A neural network can be thought as a composite feature transformation function. Given the input x ∈ R p , it transforms the input vector to a
q-dimensional space. The q-dimensional space is called the feature space.
Generalization Ability of a model to accurately classify unseen samples is called
generalization. A model is acceptable and reliable if it generalizes well on unseen
samples.
Gradient check It is a numerical technique that is used during implementing the
backpropagation algorithm. This technique ensures that gradient computation is
done correctly by the implemented backpropagation algorithm.
Loss function Training a classification model is not possible unless there is an
objective function that tells how good is the model on classifying training samples.
This objective function is called loss function.
Mini-batch gradient descend Mini-batch gradient descend is an optimization
technique which tries to solve the high variance issue of stochastic gradient
descend and high computation of batch gradient descend. Instead of using only
one sample (stochastic gradient descend) or whole samples of the dataset (batch
gradient descend) it computes the gradient over a few samples (60 samples for
instance) from training set.
Momentum gradient descend Momentum gradient descend is a variant of gradient descend where gradients are accumulated at each iteration. Parameters are
updated based the accumulated gradients rather than the gradient in current iteration. Theoretically, it increases the convergence speed on ravine surfaces.
Nesterov accelerated gradient Main issue with momentum gradient descend is
that it accumulates gradients blindly and it corrects its course after making a
mistake. Nesterov gradient descend partially addresses this problem by computing
the gradient on the next step rather than current step. In other words, it tries to
correct its course before making a mistake.
Neuron activation The output computed by applying an activation function such
as ReLU on the output of neuron.
Object detection The goal of object detection is to locate instances of a particular
object such as traffic sign on an image.

Glossary

277

Object classification Object classification is usually the next step after object
detection. Its aim is to categorize the image into one of object classes. For example,
after detecting the location of traffic signs in an image, the traffic sign classifiers
try to find the exact category of each sign.
Object recognition It usually refers to detection and classification of objects in
an image.
Overfitting High nonlinear models such as neural network are able to model small
deviations in feature space. In many cases, this causes that the model does not
generalize well on unseen samples. This problem could be more sever if the
number of training data is not high.
Receptive field Each neuron in a convolutional neural network has a receptive
field on input image. Receptive field of neuron z i is analogous to the region on
the input image in which changing the value of a pixel in that region will change
the output of z i . Denoting the input image using x, receptive field of z i is the
i
region on the image where δz
δx is not zero. In general, a neuron with higher depth
has usually a larger receptive on image.
Regularization Highly nonlinear models are prone to overfit on data and they may
not generalize on unseen samples especially when the number of training samples
is not high. As the magnitude of weights of the model increases it become more
and more nonlinear. Regularization is a technique to restrict magnitude of weights
and keep them less than a specific value. Two commonly used regularization
techniques are penalizing the loss function using L 1 or L 2 norms. Sometimes,
combinations of these two norms are also used for penalizing a loss function.
RMSProp The main problem with Adagrad method is that learning rates may
drop in a few iterations and after that the parameters update might become very
small or even negligible. RMSProp is a technique that alleviates the problem of
Adagrad. It has a mechanism to forget the sum of square of gradient over time.
Stochastic gradient descend Opposite of batch gradient descend is stochastic
gradient descend. In this method, gradient of loss function is computed using
only one sample from training set. The main disadvantage of this method is that
the variance of gradients could be very high causing a jittery trajectory of parameter updates.
Time to completion The total time that a model takes for a model to compute the
output.
Vanishing gradients This phenomena usually happens in deep networks with
squashing activation functions such as hyperbolic tangent or sigmoid. Because
gradient of squashing function become approximately zero as magnitude of x
increases, the gradient will become smaller and smaller as the error is backpropagated to first layers. In most cases, gradient becomes very close to zero (vanishes)
in which case the network does not learn anymore.
Width of network Width of network is equal to number of feature maps produced
in the same depth. Calculating width of network in architectures such as AlexNet
is simple. But computing width of network in architectures such as GoogleNet is
slightly harder since there are several layers in the same depth in its corresponding
computational graph.

278

Glossary

Reference
Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. Lecture notes in computer science, pp 437–478. doi:10.1007/978-3-642-35289-8-26,
arXiv:1206.5533

Index

A
Accuracy, 124
Activation function, 62, 71, 142
AdaDelta, 154
Adagrad, 154, 271
Adam, 154
Advanced Driver Assistant System (ADAS),
1, 167
AlexNet, 100
Artificial neura network, 61
Artificial neuron, 62
Augmenting dataset, 177
Average pooling, 97, 127, 144
Axon, 62
Axon terminals, 62
B
Backpropagation, 65, 68, 92
Backward pass, 68
Bagging, 201
Batch gradient descend, 263
Batch normalization, 127
Bernoulli distribution, 35
Bias, 20, 62
Binary classification, 16, 20, 41
Boosting, 201
Boutons, 62
C
Caffe, 105, 131
Classification, 16
Classification accuracy, 54, 106, 108, 159
Classification metric function, 106
Classification score, 22, 44
Class overlap, 16
Cluster based sampling, 175

Computational graph, 50, 59, 65
Confusion matrix, 108
Contrast normalization, 185
Contrast stretching, 185
Convolutional neural networks, 9, 63, 85
Convolution layer, 89
Covariance matrix, 8
Cross entropy, 65
Cross-entropy loss, 36, 147
Cross-validation, 176
hold-out, 176
K-fold, 177
CuDNN, 104, 131
Curse of dimensionality, 20
D
Dead neuron, 74
Decision boundary, 18, 21, 22, 30, 31, 116
Decision stump, 201
Deep neural network, 73
Dendrites, 62
Development set, 105
Directed acyclic graph, 102
Discriminant function, 20
Dot product, 21
Downsampling, 95
Dropout, 119, 146
Dropout ratio, 120
DUPLEX sampling, 175
E
Early stopping, 189
Eigenvalue, 8
Eigenvector, 8
Elastic net, 9, 119
Ensemble learning, 200

© Springer International Publishing AG 2017
H. Habibi Aghdam and E. Jahani Heravi, Guide to Convolutional
Neural Networks, DOI 10.1007/978-3-319-57550-6

279

280
Epoch, 139
Error function, 22
Euclidean distance, 5, 8, 175
Exploding gradient problem, 119
Exponential annealing, 123, 153
Exponential Linear Unit (ELU), 76, 142
F
F1-score, 111, 124, 161
False-negative, 108
False-positive, 2, 108
Feature extraction, 5, 57
Feature learning, 7
Feature maps, 90
Feature vector, 5, 57
Feedforward neural network, 63, 85
Fully connected, 81, 146
G
Gabor filter, 90
Gaussian smoothing, 178
Generalization, 116
Genetic algorithm, 203
Global pooling, 144
Gradient check, 131
Gradient descend, 24
Gradient vanishing problem, 119
GTSRB, 173
H
Hand-crafted feature, 58
Hand-crafted methods, 6
Hand-engineered feature, 58
Hidden layer, 63
Hierarchical clustering, 175
Hinge loss, 31, 32, 38, 39, 147
Histogram equalization, 185
Histogram of Oriented Gradients (HOG), 6,
7, 57, 58, 80, 247
Hold-out cross-validation, 176
HSI color space, 6
HSV color space, 180
Hyperbolic tangent, 72, 190
Hyperparameter, 7, 64
I
ImageNet, 100
Imbalanced dataset, 45, 185
downsampling, 186
hybrid sampling, 186
synthesizing data, 187

Index
upsampling, 186
weighted loss function, 186
Imbalanced set, 107
Indegree, 68
Initialization
MRSA, 142
Xavier, 142
Intercept, 20, 62
Inverse annealing, 123, 153
Isomaps, 198
K
Keras, 104
K-fold cross-validation, 177
K-means, 175
K nearest neighbor, 17
L
L1 regularization, 118
L2 regularization, 118
Lasagne, 104
Leaky Rectified Linear Unit (Leaky ReLU),
75, 142
Learning rate, 121
LeNet-5, 98, 150
Likelihood, 50
Linear model, 17
Linear separability, 16
Lipschitz constant, 227
Local binary pattern, 247
Local linear embedding, 198
Local response normalization, 101, 126
Log-likelihood, 50
Log-linear model, 48
Logistic loss, 38, 59
Logistic regression, 34, 52, 63
Loss function, 22
0/1 loss, 23, 46
cross-entropy loss, 36
hinge loss, 31
squared loss, 24
M
Majority voting, 42, 200
Margin, 30
Matching pursuit, 9
Max-norm regularization, 119, 154
Max-pooling, 95, 100, 127, 144
Mean square error, 25
Mean-variance normalization, 112, 114
Median filter, 179

Index

281

Mini-batch gradient descend, 266
Mirroing, 182
Mixed pooling, 127
Model averaging, 200
Model bias, 117, 125
Model variance, 117, 125, 175
Modified Huber loss, 34
Momentum gradient descend, 267, 268
Motion blur, 178
MRSA initialization, 142
Multiclass classification, 41
Multiclass hinge, 47
Multiclass logistic loss, 190

Random forest, 52
Randomized Rectified Linear Unit
(RReLU), 76
Random sampling, 175
Recall, 110
Receptive field, 88
Reconstruction error, 8
Rectified Linear Unit (ReLU), 74, 100, 117,
142
Recurrent neural networks, 63
Reenforcement learning, 15
Regression, 16
Regularization, 117, 153
Reinforcement learning, 15
RMSProp, 154, 272

N
Nesterov, 154
Nesterov accelerated gradients, 270
Neuron, 61
Nonparametric models, 17
Nucleus, 62
Numpy, 158
O
Objective function, 259
One versus one, 41
One versus rest, 44
Otsu thresholding, 6
Outdegree, 68
Outliers, 37
Output layer, 63
Overfit, 39, 116
P
Parameterized Rectified Linear
(PReLU), 142, 163
Parametric models, 17
Parametrized Rectified Linear
(PReLU), 75
Pooling, 90, 95
average pooling, 144
global pooling, 144
max pooling, 144
stochastic pooling, 144
Portable pixel map, 173
Posterior probability, 49
Precision, 110
Principal component analysis, 8
Protocol Buffers, 133
R
Random cropping, 180

Unit

Unit

S
Sampling
cluster based sampling, 175
DUPLEX sampling, 175
random sampling, 175
Saturated gradient, 29
Self organizing maps, 198
Shallow neural network, 73
Sharpening, 179
Shifted ReLU, 165
Sigmoid, 63, 71, 99, 117, 142
Sliding window, 238
Softmax, 49
Softplus, 60, 77, 162
Softsign, 73
Soma, 62
Sparse, 9
Sparse coding, 9
Spatial pyramid pooling, 127
Squared hinge loss, 33
Squared loss, 38
Squared loss function, 24
Step annealing, 123
Step-based annealing, 153
Stochastic gradient descend, 264
Stochastic pooling, 97, 144
Stride, 94
Supervised learning, 15
T
TanH, 142
T-distributed stochastic neighbor embedding, 198, 249
Template matching, 5
cross correlation, 5

282

Index

normalized cross correlation, 5
normalized sum of square differences, 5
sum of squared differences, 5
Tensor, 139
TensorFlow, 104
Test set, 105, 158, 174
Theano, 103
Time-to-completion, 207
Torch, 104
Training data, 8
Training set, 105, 158, 174
True-negative, 108
True-positive, 108
True-positive rate, 3

V
Validation set, 101, 106, 149, 158, 174
Vanishing gradient problem, 74
Vanishing gradients, 71, 119
Vienna convention on road traffic signs, 3
Visualization, 248

U
Universal approximator, 64
Unsupervised, 8
Unsupervised learning, 15

Z
Zero padding, 140, 144
Zero-centered, 8
Zero-one loss, 38

W
Weighted voting, 200
Weight sharing, 88
X
Xavier initialization, 113, 142



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : Yes
Has XFA                         : No
Language                        : EN
XMP Toolkit                     : Adobe XMP Core 5.6-c015 81.159809, 2016/11/11-01:42:16
Format                          : application/pdf
Title                           : 433751_Print.indd
Creator                         : 0002624
Create Date                     : 2017:05:11 08:49:17+05:30
Creator Tool                    : Adobe InDesign CS5.5 (7.5)
Modify Date                     : 2017:05:16 09:11:05+05:30
Metadata Date                   : 2017:05:16 09:11:05+05:30
Producer                        : Acrobat Distiller 10.0.0 (Windows)
Document ID                     : uuid:dbeef0d8-e987-45ee-bb84-73340537832e
Instance ID                     : uuid:991e6d3e-7258-484c-9b12-3e6b678aa8f2
Rendition Class                 : default
Version ID                      : 1
History Action                  : converted, converted
History Instance ID             : uuid:d12af5f2-701f-401a-bfc4-6ce10c57cc46, uuid:5c71d00f-7cbe-4794-8879-8b53f13dbb7e
History Parameters              : converted to PDF/A-2b, converted to PDF/A-2b
History Software Agent          : pdfToolbox, pdfToolbox
History When                    : 2017:05:16 09:08:53+05:30, 2017:05:16 09:10:21+05:30
Part                            : 2
Conformance                     : B
Schemas Namespace URI           : http://ns.adobe.com/pdf/1.3/
Schemas Prefix                  : pdf
Schemas Schema                  : Adobe PDF Schema
Schemas Property Category       : internal
Schemas Property Description    : A name object indicating whether the document has been modified to include trapping information
Schemas Property Name           : Trapped
Schemas Property Value Type     : Text
Page Mode                       : UseOutlines
Page Count                      : 303
Author                          : 0002624
EXIF Metadata provided by EXIF.tools

Navigation menu