A Guide To Convolutional Neural Networks For Computer Vision

User Manual:

Open the PDF directly: View PDF .
Page Count: 209

Download
Open PDF In Browser	View PDF

A Guide to Convolutional Neural Networks for Computer Vision
Salman Khan, Data61-CSIRO and Australian National University
Hossein Rahmani, The University of Western Australia
Syed Afaq Ali Shah, The University of Western Australia
Mohammed Bennamoun, The University of Western Australia
Computer vision has become increasingly important and effective in recent years due to its wide-ranging
applications in areas as diverse as smart surveillance and monitoring, health and medicine, sports and recreation,
robotics, drones, and self-driving cars. Visual recognition tasks, such as image classification, localization, and
detection, are the core building blocks of many of these applications, and recent developments in Convolutional
Neural Networks (CNNs) have led to outstanding performance in these state-of-the-art visual recognition tasks
and systems. As a result, CNNs now form the crux of deep learning algorithms in computer vision.
This self-contained guide will benefit those who seek to both understand the theory behind CNNs
and to gain hands-on experience on the application of CNNs in computer vision. It provides a comprehensive
introduction to CNNs starting with the essential concepts behind neural networks: training, regularization, and
optimization of CNNs. The book also discusses a wide range of loss functions, network layers, and popular CNN
architectures, reviews the different techniques for the evaluation of CNNs, and presents some popular CNN tools
and libraries that are commonly used in computer vision. Further, this text describes and discusses case studies that
are related to the application of CNN in computer vision, including image classification, object detection, semantic
segmentation, scene understanding, and image generation.
This book is ideal for undergraduate and graduate students, as no prior background knowledge in the
field is required to follow the material, as well as new researchers, developers, engineers, and practitioners who are
interested in gaining a quick understanding of CNN models.

A GUIDE TO CONVOLUTIONAL NEURAL NETWORKS FOR COMPUTER VISION

Series Editors: Gérard Medioni, University of Southern California
Sven Dickinson, University of Toronto

KHAN • ET AL

Series ISSN: 2153-1056

About SYNTHESIS

store.morganclaypool.com

MORGAN & CLAYPOOL

This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis
books provide concise, original presentations of important research and
development topics, published quickly, in digital and print formats.

A Guide to
Convolutional Neural
Networks for
Computer Vision
Salman Khan
Hossein Rahmani
Syed Afaq Ali Shah
Mohammed Bennamoun

A Guide to
Convolutional Neural Networks
for Computer Vision

Synthesis Lectures on
Computer Vision
Editors
Gérard Medioni, University of Southern California
Sven Dickinson, University of Toronto
Synthesis Lectures on Computer Vision is edited by Gérard Medioni of the University of Southern
California and Sven Dickinson of the University of Toronto. The series publishes 50–150 page
publications on topics pertaining to computer vision and pattern recognition. The scope will largely
follow the purview of premier computer science conferences, such as ICCV, CVPR, and ECCV.
Potential topics include, but not are limited to:
• Applications and Case Studies for Computer Vision
• Color, Illumination, and Texture
• Computational Photography and Video
• Early and Biologically-inspired Vision
• Face and Gesture Analysis
• Illumination and Reﬂectance Modeling
• Image-Based Modeling
• Image and Video Retrieval
• Medical Image Analysis
• Motion and Tracking
• Object Detection, Recognition, and Categorization
• Segmentation and Grouping
• Sensors
• Shape-from-X
• Stereo and Structure from Motion
• Shape Representation and Matching

• Statistical Methods and Learning
• Performance Evaluation
• Video Analysis and Event Recognition

A Guide to Convolutional Neural Networks for Computer Vision
Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, and Mohammed Bennamoun
2018

Covariances in Computer Vision and Machine Learning
Hà Quang Minh and Vittorio Murino
2017

Elastic Shape Analysis of Three-Dimensional Objects
Ian H. Jermyn, Sebastian Kurtek, Hamid Laga, and Anuj Srivastava
2017

The Maximum Consensus Problem: Recent Algorithmic Advances
Tat-Jun Chin and David Suter
2017

Extreme Value Theory-Based Methods for Visual Recognition
Walter J. Scheirer
2017

Data Association for Multi-Object Visual Tracking
Margrit Betke and Zheng Wu
2016

Ellipse Fitting for Computer Vision: Implementation and Applications
Kenichi Kanatani, Yasuyuki Sugaya, and Yasushi Kanazawa
2016

Computational Methods for Integrating Vision and Language
Kobus Barnard
2016

Background Subtraction: Theory and Practice
Ahmed Elgammal
2014

Vision-Based Interaction
Matthew Turk and Gang Hua
2013

Camera Networks: The Acquisition and Analysis of Videos over Wide Areas
Amit K. Roy-Chowdhury and Bi Song
2012

Deformable Surface 3D Reconstruction from Monocular Images
Mathieu Salzmann and Pascal Fua
2010

Boosting-Based Face Detection and Adaptation
Cha Zhang and Zhengyou Zhang
2010

Image-Based Modeling of Plants and Trees
Sing Bing Kang and Long Quan
2009

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
A Guide to Convolutional Neural Networks for Computer Vision
Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, and Mohammed Bennamoun
www.morganclaypool.com

ISBN: 9781681730219
ISBN: 9781681730226
ISBN: 9781681732787

paperback
ebook
hardcover

DOI 10.2200/S00822ED1V01Y201712COV015

A Publication in the Morgan & Claypool Publishers series
SYNTHESIS LECTURES ON COMPUTER VISION
Lecture #15
Series Editors: Gérard Medioni, University of Southern California
Sven Dickinson, University of Toronto
Series ISSN
Print 2153-1056 Electronic 2153-1064

A Guide to
Convolutional Neural Networks
for Computer Vision

Salman Khan
Data61-CSIRO and Australian National University

Hossein Rahmani
The University of Western Australia, Crawley, WA

Syed Afaq Ali Shah
The University of Western Australia, Crawley, WA

Mohammed Bennamoun
The University of Western Australia, Crawley, WA

SYNTHESIS LECTURES ON COMPUTER VISION #15

M
&C

Morgan

& cLaypool publishers

ABSTRACT
Computer vision has become increasingly important and eﬀective in recent years due to its
wide-ranging applications in areas as diverse as smart surveillance and monitoring, health and
medicine, sports and recreation, robotics, drones, and self-driving cars. Visual recognition tasks,
such as image classiﬁcation, localization, and detection, are the core building blocks of many of
these applications, and recent developments in Convolutional Neural Networks (CNNs) have
led to outstanding performance in these state-of-the-art visual recognition tasks and systems.
As a result, CNNs now form the crux of deep learning algorithms in computer vision.
This self-contained guide will beneﬁt those who seek to both understand the theory behind CNNs and to gain hands-on experience on the application of CNNs in computer vision.
It provides a comprehensive introduction to CNNs starting with the essential concepts behind
neural networks: training, regularization, and optimization of CNNs. The book also discusses a
wide range of loss functions, network layers, and popular CNN architectures, reviews the diﬀerent techniques for the evaluation of CNNs, and presents some popular CNN tools and libraries
that are commonly used in computer vision. Further, this text describes and discusses case studies that are related to the application of CNN in computer vision, including image classiﬁcation,
object detection, semantic segmentation, scene understanding, and image generation.
This book is ideal for undergraduate and graduate students, as no prior background knowledge in the ﬁeld is required to follow the material, as well as new researchers, developers, engineers, and practitioners who are interested in gaining a quick understanding of CNN models.

KEYWORDS
deep learning, computer vision, convolution neural networks, perception, backpropagation, feed-forward networks, image classiﬁcation, action recognition, object
detection, object tracking, video processing, semantic segmentation, scene understanding, 3D processing

SK: To my parents and my wife Nusrat
HR: To my father Shirzad, my mother Rahimeh, and my wife
Shahla
AS:

To my parents, my wife Maleeha, and our children Abiya,
Maryam, and Muhammad. Thanks for always being there
for me.

MB: To my parents: Mostefa and Rabia Bennamoun and to my
nuclear family: Leila, Miriam, Basheer, and Rayaane Bennamoun

Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1

1.2
1.3

1
1
3
4
6
6

Features and Classiﬁers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1

2.2

2.3

2.4

What is Computer Vision? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.2 Image Processing vs. Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . .
What is Machine Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Why Deep Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Importance of Features and Classiﬁers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Classiﬁers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Traditional Feature Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Histogram of Oriented Gradients (HOG) . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Scale-invariant Feature Transform (SIFT) . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Speeded-up Robust Features (SURF) . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Limitations of Traditional Hand-engineered Features . . . . . . . . . . . . .
Machine Learning Classiﬁers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Random Decision Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11
11
13
13
14
16
19
21
21
22
26
29

Neural Networks Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1
3.2

3.3

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multi-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Architecture Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31
32
32
32
36

xii

3.4

37
38
39
39
40
41

Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1
4.2

4.3

3.3.1 Architecture Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Link with Biological Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Computational Model of a Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Artiﬁcial vs. Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Network Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Pooling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.5 Fully Connected Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.6 Transposed Convolution Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.7 Region of Interest Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.8 Spatial Pyramid Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.9 Vector of Locally Aggregated Descriptors Layer . . . . . . . . . . . . . . . . . .
4.2.10 Spatial Transformer Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CNN Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Cross-entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 SVM Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Squared Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.4 Euclidean Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.5 The `1 Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.6 Contrastive Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.7 Expectation Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.8 Structural Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43
44
44
45
53
54
56
57
60
61
63
63
65
66
66
67
67
67
67
68
68

CNN Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1

Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Gaussian Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Uniform Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.3 Orthogonal Random Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.4 Unsupervised Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.5 Xavier Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69
69
70
70
70
70

xiii

5.1.6 ReLU Aware Scaled Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.7 Layer-sequential Unit Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.8 Supervised Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2

Regularization of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 Drop-connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.5 Ensemble Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.6 The `2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.7 The `1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.8 Elastic Net Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.9 Max-norm Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.10 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73
73
75
75
76
77
77
78
78
79
79

5.3

Gradient-based CNN Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 Mini-batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80
80
81
81

5.4

Neural Network Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Nesterov Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.3 Adaptive Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.4 Adaptive Delta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.5 RMSprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.6 Adaptive Moment Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81
81
82
83
84
85
85

5.5

Gradient Computation in CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Analytical Diﬀerentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.2 Numerical Diﬀerentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.3 Symbolic Diﬀerentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.4 Automatic Diﬀerentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87
87
87
88
89

5.6

Understanding CNN through Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Visualizing Learned Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.2 Visualizing Activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.3 Visualizations based on Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92
92
93
95

xiv

Examples of CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1

LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2

AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3

Network in Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4

VGGnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.5

GoogleNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.6

ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.7

ResNeXt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.8

FractalNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.9

DenseNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Applications of CNNs in Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1

Image Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1.1 PointNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.2

Object Detection and Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Region-based CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.2 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.3 Regional Proposal Network (RPN) . . . . . . . . . . . . . . . . . . . . . . . . . . .

120
120
122
123

7.3

Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Fully Convolutional Network (FCN) . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Deep Deconvolution Network (DDN) . . . . . . . . . . . . . . . . . . . . . . . .
7.3.3 DeepLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127
127
130
133

7.4

Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.1 DeepContext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.2 Learning Rich Features from RGB-D Images . . . . . . . . . . . . . . . . . .
7.4.3 PointNet for Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135
135
139
141

7.5

Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.1 Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . . .
7.5.2 Deep Convolutional Generative Adversarial Networks (DCGANs) .
7.5.3 Super Resolution Generative Adversarial Network (SRGAN) . . . . . .

141
142
145
147

7.6

Video-based Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6.1 Action Recognition From Still Video Frames . . . . . . . . . . . . . . . . . . .
7.6.2 Two-stream CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6.3 Long-term Recurrent Convolutional Network (LRCN) . . . . . . . . . . .

150
150
153
155

Deep Learning Tools and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10

Caﬀe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MatConvNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Torch7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Theano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lasagne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marvin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

159
160
160
161
161
162
162
164
164
165

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

xvii

Preface
The primary goal of this book is to provide a comprehensive treatment to the subject of convolutional neural networks (CNNs) from the perspective of computer vision. In this regard, this
book covers basic, intermediate and well as advanced topics relating to both the theoretical and
practical aspects.
This book is organized into nine chapters. The ﬁrst chapter introduces the computer vision and machine learning disciplines and presents their highly relevant application domains.
This sets up the platform for the main subject of this book, “Deep Learning”, which is ﬁrst deﬁned towards the later part of ﬁrst chapter. The second chapter serves as a background material,
which presents popular hand-crafted features and classiﬁers which have remained popular in
computer vision during the last two decades. These include feature descriptors such as ScaleInvariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG), Speeded-Up
Robust Features (SURF), and classiﬁers such as Support Vector Machines (SVM), and Random
Decision Forests (RDF).
Chapter 3 describes neural networks and covers preliminary concepts related to their architecture, basic building blocks, and learning algorithms. Chapter 4 builds on this and serves
as a thorough introduction to CNN architecture. It covers its layers, including the basic ones
(e.g., sub-sampling, convolution) as well as more advanced ones (e.g., pyramid pooling, spatial transform). Chapter 5 comprehensively presents techniques to learn and regularize CNN
parameters. It also provides tools to visualize and understand the learned parameters.
Chapter 6 and onward are more focused on the practical aspects of CNNs. Speciﬁcally,
Chapter 6 presents state-of-the-art CNN architectures that have demonstrated excellent performances on a number of vision tasks. It also provides a comparative analysis and discusses
their relative pros and cons. Chapter 7 goes in further depth regarding applications of CNNs to
core vision problems. For each task, it discusses a set of representative works using CNNs and
reports their key ingredients for success. Chapter 8 covers popular software libraries for deep
learning such as Theano, Tensorﬂow, Caﬀe, and Torch. Finally, in Chapter 9, open problems
and challenges for deep learning are presented along with a succinct summary of the book.
The purpose of the book is not to provide a literature survey for the applications of CNNs
in computer vision. Rather, it succinctly covers key concepts and provides a bird’s eye view of
recent state-of-the-art models designed for practical problems in computer vision.
Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, and Mohammed Bennamoun
January 2018

xix

Acknowledgments
We would like to thank Gerard Medioni and Sven Dickinson, the editors of this Synthesis Lectures on Computer Vision series, for giving us an opportunity to contribute to this series. We
greatly appreciate the help and support of Diane Cerra, Executive Editor at Morgan & Claypool, who managed the complete book preparation process. We are indebted to our colleagues,
students, collaborators, and co-authors we worked with during our careers, who contributed to
the development of our interest in this subject. We are also deeply thankful to the wider research
community, whose work has led to major advancements in computer vision and machines learning, a part of which is covered in this book. More importantly, we want to express our gratitude
toward the people who allowed us to use their ﬁgures or tables in some portions of this book. This
book has greatly beneﬁted from the constructive comments and appreciation by the reviewers,
which helped us improve the presented content. Finally, this eﬀort would not have been possible
without the help and support from our families.
We would like to acknowledge support from Australian Research Council (ARC), whose
funding and support was crucial to some of the contents of this book.
Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, and Mohammed Bennamoun
January 2018

CHAPTER

Introduction
Computer Vision and Machine Learning have played together decisive roles in the development
of a variety of image-based applications within the last decade (e.g., various services provided
by Google, Facebook, Microsoft, Snapchat). During this time, the vision-based technology has
transformed from just a sensing modality to intelligent computing systems which can understand the real world. Thus, acquiring computer vision and machine learning (e.g., deep learning)
knowledge is an important skill that is required in many modern innovative businesses and is
likely to become even more important in the near future.

1.1

WHAT IS COMPUTER VISION?

Humans use their eyes and their brains to see and understand the 3D world around them. For
example, given an image as shown in Fig. 1.1a, humans can easily see a “cat” in the image and
thus, categorize the image (classiﬁcation task); localize the cat in the image (classiﬁcation plus
localization task as shown in Fig. 1.1b); localize and label all objects that are present in the image
(object detection task as shown in Fig. 1.1c); and segment the individual objects that are present
in the image (instance segmentation task as shown in Fig. 1.1d). Computer vision is the science
that aims to give a similar, if not better, capability to computers. More precisely, computer vision
seeks to develop methods which are able to replicate one of the most amazing capabilities of the
human visual system, i.e., inferring characteristics of the 3D real world purely using the light
reﬂected to the eyes from various objects.
However, recovering and understanding the 3D structure of the world from twodimensional images captured by cameras is a challenging task. Researchers in computer vision
have been developing mathematical techniques to recover the three-dimensional shape and appearance of objects/scene from images. For example, given a large enough set of images of an
object captured from a variety of views (Fig. 1.2), computer vision algorithms can reconstruct
an accurate dense 3D surface model of the object using dense correspondences across multiple
views. However, despite all of these advances, understanding images at the same level as humans
still remains challenging.

1.1.1 APPLICATIONS
Due to the signiﬁcant progress in the ﬁeld of computer vision and visual sensor technology,
computer vision techniques are being used today in a wide variety of real-world applications,
such as intelligent human-computer interaction, robotics, and multimedia. It is also expected

1. INTRODUCTION

Classiﬁcation

Classiﬁcation
+ Localization

Object Detection

Instance Segmentation

CAT

CAT, DOG, DUCK

Figure 1.1: What do we want computers to do with the image data? To look at the image and
perform classiﬁcation, classiﬁcation plus localization (i.e., to ﬁnd a bounding box around the
main object (CAT) in the image and label it), to localize all objects that are present in the image
(CAT, DOG, DUCK) and to label them, or perform semantic instance segmentation, i.e., the
segmentation of the individual objects within a scene, even if they are of the same type.
View#5
View#6

View#4
View#1

View#3
View#2

Figure 1.2: Given a set of images of an object (e.g., upper human body) captured from six different viewpoints, a dense 3D model of the object can be reconstructed using computer vision
algorithms.

that the next generation of computers could even understand human actions and languages at
the same level as humans, carry out some missions on behalf of humans, and respond to human
commands in a smart way.

1.1. WHAT IS COMPUTER VISION?

Human-computer Interaction
Nowadays, video cameras are widely used for human-computer interaction and in the entertainment industry. For instance, hand gestures are used in sign language to communicate, transfer
messages in noisy environments, and interact with computer games. Video cameras provide a
natural and intuitive way of human communication with a device. Therefore, one of the most
important aspects for these cameras is the recognition of gestures and short actions from videos.
Robotics
Integrating computer vision technologies with high-performance sensors and cleverly designed
hardware has given rise to a new generation of robots which can work alongside humans and
perform many diﬀerent tasks in unpredictable environments. For example, an advanced humanoid robot can jump, talk, run, or walk up stairs in a very similar way a human does. It can
also recognize and interact with people. In general, an advanced humanoid robot can perform
various activities that are mere reﬂexes for humans and do not require a high intellectual eﬀort.
Multimedia
Computer vision technology plays a key role in multimedia applications. These have led to a
massive research eﬀort in the development of computer vision algorithms for processing, analyzing, and interpreting multimedia data. For example, given a video, one can ask “What does
this video mean?”, which involves a quite challenging task of image/video understanding and
summarization. As another example, given a clip of video, computers could search the Internet
and get millions of similar videos. More interestingly, when one gets tired of watching a long
movie, computers would automatically summarize the movie for them.

1.1.2 IMAGE PROCESSING VS. COMPUTER VISION
Image processing can be considered as a preprocessing step for computer vision. More precisely,
the goal of image processing is to extract fundamental image primitives, including edges and
corners, ﬁltering, morphology operations, etc. These image primitives are usually represented as
images. For example, in order to perform semantic image segmentation (Fig. 1.1), which is a
computer vision task, one might need to apply some ﬁltering on the image (an image processing
task) during that process.
Unlike image processing, which is mainly focused on processing raw images without giving any knowledge feedback on them, computer vision produces semantic descriptions of images.
Based on the abstraction level of the output information, computer vision tasks can be divided
into three diﬀerent categories, namely low-level, mid-level, and high-level vision.
Low-level Vision
Based on the extracted image primitives, low-level vision tasks could be preformed on images/videos. Image matching is an example of low-level vision tasks. It is deﬁned as the automatic

1. INTRODUCTION

identiﬁcation of corresponding image points on a given pair of the same scene from diﬀerent
view points, or a moving scene captured by a ﬁxed camera. Identifying image correspondences
is an important problem in computer vision for geometry and motion recovery.
Another fundamental low-level vision task is optical ﬂow computation and motion analysis. Optical ﬂow is the pattern of the apparent motion of objects, surfaces, and edges in a visual
scene caused by the movement of an object or camera. Optical ﬂow is a 2D vector ﬁeld where
each vector corresponds to a displacement vector showing the movement of points from one
frame to the next. Most existing methods which estimate camera motion or object motion use
optical ﬂow information.
Mid-level Vision
Mid-level vision provides a higher level of abstraction than low-level vision. For instance, inferring the geometry of objects is one of the major aspects of mid-level vision. Geometric vision
includes multi-view geometry, stereo, and structure from motion (SfM), which infer the 3D
scene information from 2D images such that 3D reconstruction could be made possible. Another task of mid-level vision is visual motion capturing and tracking, which estimate 2D and
3D motions, including deformable and articulated motions. In order to answer the question
“How does the object move?,” image segmentation is required to ﬁnd areas in the images which
belong to the object.
High-level Vision
Based on an adequate segmented representation of the 2D and/or 3D structure of the image,
extracted using lower level vision (e.g., low-level image processing, low-level and mid-level vision), high-level vision completes the task of delivering a coherent interpretation of the image.
High-level vision determines what objects are present in the scene and interprets their interrelations. For example, object recognition and scene understanding are two high-level vision tasks
which infer the semantics of objects and scenes, respectively. How to achieve robust recognition,
e.g., recognizing object from diﬀerent viewpoint is still a challenging problem.
Another example of higher level vision is image understanding and video understanding.
Based on information provided by object recognition, image and video understanding try to
answer questions such as “Is there a tiger in the image?” or “Is this video a drama or an action?,” or
“Is there any suspicious activity in a surveillance video?” Developing such high-level vision tasks
helps to fulﬁll diﬀerent higher level tasks in intelligent human-computer interaction, intelligent
robots, smart environment, and content-based multimedia.

1.2

WHAT IS MACHINE LEARNING?

Computer vision algorithms have seen a rapid progress in recent years. In particular, combining
computer vision with machine learning contributes to the development of ﬂexible and robust
computer vision algorithms and, thus, improving the performance of practical vision systems.

1.2. WHAT IS MACHINE LEARNING?

For instance, Facebook has combined computer vision, machine learning, and their large corpus
of photos, to achieve a robust and highly accurate facial recognition system. That is how Facebook can suggest who to tag in your photo. In the following, we ﬁrst deﬁne machine learning
and then describe the importance of machine learning for computer vision tasks.
Machine learning is a type of artiﬁcial intelligence (AI) which allows computers to learn
from data without being explicitly programmed. In other words, the goal of machine learning is to design methods that automatically perform learning using observations of the real
world (called the “training data”), without explicit deﬁnition of rules or logic by the humans
(“trainer”/“supervisor”). In that sense, machine learning can be considered as programming by
data samples. In summary, machine learning is about learning to do better in the future based
on what was experienced in the past.
A diverse set of machine learning algorithms has been proposed to cover the wide variety of data and problem types. These learning methods can be mainly divided into three main
approaches, namely supervised, semi-supervised, and unsupervised. However, the majority of practical machine learning methods are currently supervised learning methods, because of their superior performance compared to other counter-parts. In supervised learning methods, the training
data takes the form of a collection of .data:x; label:y/ pairs and the goal is to produce a prediction
y in response to a query sample x . The input x can be a features vector, or more complex data
such as images, documents, or graphs. Similarly, diﬀerent types of output y have been studied. The output y can be a binary label which is used in a simple binary classiﬁcation problem
(e.g., “yes” or “no”). However, there has also been numerous research works on problems such
as multi-class classiﬁcation where y is labeled by one of k labels, multi-label classiﬁcation where
y takes on simultaneously the K labels, and general structured prediction problems where y is
a high-dimensional output, which is constructed from a sequence of predictions (e.g., semantic
segmentation).
Supervised learning methods approximate a mapping function f .x/ which can predict
the output variables y for a given input sample x . Diﬀerent forms of mapping function f .:/ exist
(some are brieﬂy covered in Chapter 2), including decision trees, Random Decision Forests
(RDF), logistic regression (LR), Support Vector Machines (SVM), Neural Networks (NN),
kernel machines, and Bayesian classiﬁers. A wide range of learning algorithms has also been
proposed to estimate these diﬀerent types of mappings.
On the other hand, unsupervised learning is where one would only have input data X
and no corresponding output variables. It is called unsupervised learning because (unlike supervised learning) there are no ground-truth outputs and there is no teacher. The goal of unsupervised learning is to model the underlying structure/distribution of data in order to discover an
interesting structure in the data. The most common unsupervised learning method is the clustering approach such as hierarchical clustering, k-means clustering, Gaussian Mixture Models
(GMMs), Self-Organizing Maps (SOMs), and Hidden Markov Models (HMMs).

1. INTRODUCTION

Semi-supervised learning methods sit in-between supervised and unsupervised learning.
These learning methods are used when a large amount of input data is available and only some
of the data is labeled. A good example is a photo archive where only some of the images are
labeled (e.g., dog, cat, person), and the majority are unlabeled.

1.2.1 WHY DEEP LEARNING?
While these machine learning algorithms have been around for a long time, the ability to automatically apply complex mathematical computations to large-scale data is a recent development.
This is because the increased power of today’s computers, in terms of speed and memory, has
helped machine learning techniques evolve to learn from a large corpus of training data. For example, with more computing power and a large enough memory, one can create neural networks
of many layers, which are called deep neural networks. There are three key advantages which are
oﬀered by deep learning.
• Simplicity: Instead of problem speciﬁc tweaks and tailored feature detectors, deep networks oﬀer basic architectural blocks, network layers, which are repeated several times to
generate large networks.
• Scalability: Deep learning models are easily scalable to huge datasets. Other competing
methods, e.g., kernel machines, encounter serious computational problems if the datasets
are huge.
• Domain transfer: A model learned on one task is applicable to other related tasks and the
learned features are general enough to work on a variety of tasks which may have scarce
data available.
Due to the tremendous success in learning these deep neural networks, deep learning techniques are currently state-of-the-art for the detection, segmentation, classiﬁcation and recognition (i.e., identiﬁcation and veriﬁcation) of objects in images. Researchers are now working to
apply these successes in pattern recognition to more complex tasks such as medical diagnoses
and automatic language translation. Convolutional Neural Networks (ConvNets or CNNs) are
a category of deep neural networks which have proven to be very eﬀective in areas such as image
recognition and classiﬁcation (see Chapter 7 for more details). Due to the impressive results of
CNNs in these areas, this book is mainly focused on CNNs for computer vision tasks. Figure 1.3
illustrates the relation between computer vision, machine learning, human vision, deep learning,
and CNNs.

1.3

BOOK OVERVIEW

CHAPTER 2
The book begins in Chapter 2 with a review of the traditional feature representation and classiﬁcation methods. Computer vision tasks, such as image classiﬁcation and object detection, have

1.3. BOOK OVERVIEW

Machine
Learning

CNN

Computer
Vision

Human
Vision

Deep
Learning

Figure 1.3: The relation between human vision, computer vision, machine learning, deep learning, and CNNs.

traditionally been approached using hand-engineered features which are divided into two diﬀerent main categories: global features and local features. Due to the popularity of the low-level representation, this chapter ﬁrst reviews three widely used low-level hand-engineered descriptors,
namely Histogram of Oriented Gradients (HOG) [Triggs and Dalal, 2005], Scale-Invariant
Feature Transform (SIFT) [Lowe, 2004], and Speed-Up Robust Features (SURF) [Bay et al.,
2008]. A typical computer vision system feeds these hand-engineered features to machine learning algorithms to classify images/videos. Two widely used machine learning algorithms, namely
SVM [Cortes, 1995] and RDF [Breiman, 2001, Quinlan, 1986], are also introduced in details.

CHAPTER 3
The performance of a computer vision system is highly dependent on the features used. Therefore, current progress in computer vision has been based on the design of feature learners which
minimizes the gap between high-level representations (interpreted by humans) and low-level
features (detected by HOG [Triggs and Dalal, 2005] and SIFT [Lowe, 2004] algorithms).
Deep neural networks are one of the well-known and popular feature learners which allow the
removal of complicated and problematic hand-engineered features. Unlike the standard feature
extraction algorithms (e.g., SIFT and HOG), deep neural networks use several hidden layers
to hierarchically learn the high level representation of an image. For instance, the ﬁrst layer
might detect edges and curves in the image, the second layer might detect object body-parts
(e.g., hands or paws or ears), the third layer might detect the whole object, etc. In this chapter,
we provide an introduction to deep neural networks, their computational mechanism and their
historical background. Two generic categories of deep neural networks, namely feed-forward
and feed-back networks, with their corresponding learning algorithms are explained in detail.

1. INTRODUCTION

CHAPTER 4
CNNs are a prime example of deep learning methods and have been most extensively studied.
Due to the lack of training data and computing power in the early days, it was hard to train a
large high-capacity CNN without overﬁtting. After the rapid growth in the amount of annotated
data and the recent improvements in the strengths of Graphics Processor Units (GPUs), research
on CNNs has emerged rapidly and achieved state-of-the-art results on various computer vision
tasks. In this chapter, we provide a broad survey of the recent advances in CNNs, including stateof-the-art layers (e.g., convolution, pooling, nonlinearity, fully connected, transposed convolution, ROI pooling, spatial pyramid pooling, VLAD, spatial transformer layers), weight initialization approaches (e.g., Gaussian, uniform and orthogonal random initialization, unsupervised
pre-training, Xavier, and Rectiﬁer Linear Unit (ReLU) aware scaled initialization, supervised
pre-training), regularization approaches (e.g., data augmentation, dropout, drop-connect, batch
normalization, ensemble averaging, the `1 and `2 regularization, elastic net, max-norm constraint, early stopping), and several loss functions (e.g., soft-max, SVM hinge, squared hinge,
Euclidean, contrastive, and expectation loss).
CHAPTER 5
The CNN training process involves the optimization of its parameters such that the loss function
is minimized. This chapter reviews well-known and popular gradient-based training algorithms
(e.g., batch gradient descent, stochastic gradient descent, mini-batch gradient descent) followed
by state-of-the-art optimizers (e.g., Momentum, Nesterov momentum, AdaGrad, AdaDelta,
RMSprop, Adam) which address the limitations of the gradient descent learning algorithms.
In order to make this book a self-contained guide, this chapter also discusses the diﬀerent approaches that are used to compute diﬀerentials of the most popular CNN layers which are employed to train CNNs using the error back-propagation algorithm.
CHAPTER 6
This chapter introduces the most popular CNN architectures which are formed using the basic
building blocks studied in Chapter 4 and Chapter 7. Both early CNN architectures which are
easier to understand (e.g., LeNet, NiN, AlexNet, VGGnet) and the recent CNN ones (e.g.,
GoogleNet, ResNet, ResNeXt, FractalNet, DenseNet), which are relatively complex, are presented in details.
CHAPTER 7
This chapter reviews various applications of CNNs in computer vision, including image classiﬁcation, object detection, semantic segmentation, scene labeling, and image generation. For each
application, the popular CNN-based models are explained in detail.

1.3. BOOK OVERVIEW

CHAPTER 8
Deep learning methods have resulted in signiﬁcant performance improvements in computer vision applications and, thus, several software frameworks have been developed to facilitate these
implementations. This chapter presents a comparative study of nine widely used deep learning
frameworks, namely Caﬀe, TensorFlow, MatConvNet, Torch7, Theano, Keras, Lasagne, Marvin, and Chainer, on diﬀerent aspects. This chapter helps the readers to understand the main
features of these frameworks (e.g., the provided interface and platforms for each framework)
and, thus, the readers can choose the one which suits their needs best.

CHAPTER

Features and Classiﬁers
Feature extraction and classiﬁcation are two key stages of a typical computer vision system. In
this chapter, we provide an introduction to these two steps: their importance and their design
challenges for computer vision tasks.
Feature extraction methods can be divided into two diﬀerent categories, namely handengineering-based methods and feature learning-based methods. Before going into the details
of the feature learning algorithms in the subsequent chapters (i.e., Chapter 3, Chapter 4, Chapter 5, and Chapter 6), we introduce in this chapter some of the most popular traditional handengineered features (e.g., HOG [Triggs and Dalal, 2005], SIFT [Lowe, 2004], SURF [Bay
et al., 2008]), and their limitations in details.
Classiﬁers can be divided into two groups, namely shallow and deep models. This
chapter also introduces some well-known traditional classiﬁers (e.g., SVM [Cortes, 1995],
RDF [Breiman, 2001, Quinlan, 1986]), which have a single learned layer and are therefore
shallow models. The subsequent chapters (i.e., Chapter 3, Chapter 4, Chapter 5, and Chapter 6) cover the deep models, including CNNs, which have multiple hidden layers and, thus,
can learn features at various levels of abstraction.

2.1

IMPORTANCE OF FEATURES AND CLASSIFIERS

The accuracy, robustness, and eﬃciency of a vision system are largely dependent on the quality
of the image features and the classiﬁers. An ideal feature extractor would produce an image representation that makes the job of the classiﬁer trivial (see Fig. 2.1). Conversely, unsophisticated
features extractors require a “perfect” classiﬁer to adequately perform the pattern recognition
task. However, ideal features extraction and a perfect classiﬁcation performance are often impossible. Thus, the goal is to extract informative and reliable features from the input images, in
order to enable the development of a largely domain-independent theory of classiﬁcation.

2.1.1 FEATURES
A feature is any distinctive aspect or characteristic which is used to solve a computational task
related to a certain application. For example, given a face image, there is a variety of approaches
to extract features, e.g., mean, variance, gradients, edges, geometric features, color features, etc.
The combination of n features can be represented as a n-dimensional vector, called a feature vector. The quality of a feature vector is dependent on its ability to discriminate image
samples from diﬀerent classes. Image samples from the same class should have similar feature

2. FEATURES AND CLASSIFIERS
Humans see this:

But, the computer sees this:

“Car”
“non-Car”

Raw
Pixel Learning
Algorithm

Pixel
2

Pixel 1

(b)
Pixel 1

(a)

Pixel 2

(e)

Feature
Representation
E.g., Wheels? Doors? etc.

(h)

Learning
Algorithm

(f )

(g)

Wheels

(d)

Doors

(i)

Figure 2.1: (a) The aim is to design an algorithm which classiﬁes input images into two diﬀerent
categories: “Car” or “non-Car.” (b) Humans can easily see the car and categorize this image
as “Car.” However, computers see pixel intensity values as shown in (c) for a small patch in the
image. Computer vision methods process all pixel intensity values and classify the image. (d) The
straightforward way is to feed the intensity values to the classiﬁers and the learned classiﬁer will
then perform the classiﬁcation job. For better visualization, let us pick only two pixels, as shown
in (e). Because pixel 1 is relatively bright and pixel 2 is relatively dark, that image has a position
shown in blue plus sign in the plot shown in (f ). By adding few positive and negative samples,
the plot in (g) shows that the positive and negative samples are extremely jumbled together. So
if this data is fed to a linear classiﬁer, the subdivision of the feature space into two classes is
not possible. (h) It turns out that a proper feature representation can overcome this problem.
For example, using more informative features such as the number of wheels in the images, the
number of doors in the images, the data looks like (i) and the images become much easier to
classify.

values and images from diﬀerent classes should have diﬀerent feature values. For the example
shown in Fig. 2.1, all cars shown in Fig. 2.2 should have similar feature vectors, irrespective of
their models, sizes, positions in the images, etc. Thus, a good feature should be informative,
invariant to noise and a set of transformations (e.g., rotation and translation), and fast to compute. For instance, features such as the number of wheels in the images, the number of doors
in the images could help to classify the images into two diﬀerent categories, namely “car” and

2.2. TRADITIONAL FEATURE DESCRIPTORS

“non-car.” However, extracting such features is a challenging problem in computer vision and
machine learning.

Figure 2.2: Images of diﬀerent classes of cars captured from diﬀerent scenes and viewpoints.

2.1.2 CLASSIFIERS
Classiﬁcation is at the heart of modern computer vision and pattern recognition. The task of the
classiﬁer is to use the feature vector to assign an image or region of interest (RoI) to a category.
The degree of diﬃculty of the classiﬁcation task depends on the variability in the feature values of
images from the same category, relative to the diﬀerence between feature values of images from
diﬀerent categories. However, a perfect classiﬁcation performance is often impossible. This is
mainly due to the presence of noise (in the form of shadows, occlusions, perspective distortions,
etc.), outliers (e.g., images from the category “buildings” might contain people, animal, building,
or car category), ambiguity (e.g., the same rectangular shape could correspond to a table or a
building window), the lack of labels, the availability of only small training samples, and the
imbalance of positive/negative coverage in the training data samples. Thus, designing a classiﬁer
to make the best decision is a challenging task.

2.2

TRADITIONAL FEATURE DESCRIPTORS

Traditional (hand-engineered) feature extraction methods can be divided into two broad categories: global and local. The global feature extraction methods deﬁne a set of global features
which eﬀectively describe the entire image. Thus, the shape details are ignored. The global features are also not suitable for the recognition of partially occluded objects. On the other hand,
the local feature extraction methods extract a local region around keypoints and, thus, can handle occlusion better [Bayramoglu and Alatan, 2010, Rahmani et al., 2014]. On that basis, the
focus of this chapter is on local features/descriptors.

2. FEATURES AND CLASSIFIERS

Various methods have been developed for detecting keypoints and constructing descriptors around them. For instance, local descriptors, such as HOG [Triggs and Dalal, 2005], SIFT
[Lowe, 2004], SURF [Bay et al., 2008], FREAK [Alahi et al., 2012], ORB [Rublee et al., 2011],
BRISK [Leutenegger et al., 2011], BRIEF [Calonder et al., 2010], and LIOP [Wang et al.,
2011b] have been used in most computer vision applications. The considerable recent progress
that has been achieved in the area of recognition is largely due to these features, e.g., optical ﬂow
estimation methods use orientation histograms to deal with large motions; image retrieval and
structure from motion are based on SIFT descriptors. It is important to note that CNNs, which
will be discussed in Chapter 4, are not that much diﬀerent than the traditional hand-engineered
features. The ﬁrst layer in the CNNs learn to utilize gradients in a way that is similar to handengineered features such as HOG, SIFT and SURF. In order to have a better understanding of
CNNs, we describe next, three important and widely used feature detectors and/or descriptors,
namely HOG [Triggs and Dalal, 2005], SIFT [Lowe, 2004], and SURF [Bay et al., 2008] in
some details. As you will see in Chapter 4, CNNs are also able to extract similar hand-engineered
features (e.g., gradients) in their lower layers but through an automatic feature learning process.

2.2.1 HISTOGRAM OF ORIENTED GRADIENTS (HOG)
HOG [Triggs and Dalal, 2005] is a feature descriptor that is used to automatically detect objects from images. The HOG descriptor encodes the distribution of directions of gradients in
localized portions of an image.
HOG features have been introduced by Triggs and Dalal [2005] who have studied the
inﬂuence of several variants of HOG descriptors (R-HOG and C-HOG), with diﬀerent gradient computation and normalization methods. The idea behind the HOG descriptors is that
the object appearance and the shape within an image can be described by the histogram of edge
directions. The implementation of these descriptors consists of the following four steps.
Gradient Computation
The ﬁrst step is the computation of the gradient values. A 1D centered point discrete derivative
mask is applied on an image in both the horizontal and vertical directions. Speciﬁcally, this
method requires the ﬁltering of the gray-scale image with the following ﬁlter kernels:

fx D Œ 1 0 C 1 and fy D Œ 1 0 C 1T :

(2.1)

Thus, given an image I , the following convolution operations (denoted by ) result in the
derivatives of the image I in the x and y directions:
Ix D I fx and Iy D I fy :

(2.2)

Thus, the orientation and the magnitude jgj of the gradient are calculated as follows:
q
Iy
and jgj D 2 Ix2 C Iy2 :
(2.3)
D arctan
Ix

2.2. TRADITIONAL FEATURE DESCRIPTORS

As you will see in Chapter 4, just like the HOG descriptor, CNNs also use convolution
operations in their layers. However, the main diﬀerence is that instead of using hand-engineered
ﬁlters, e.g., fx ; fy in Eq. (2.1), CNNs use trainable ﬁlters which make them highly adaptive.
That is why they can achieve high accuracy levels in most applications such as image recognition.
Cell Orientation Histogram
The second step is the calculation of the cell histograms. First, the image is divided into small
(usually 8 8 pixels) cells. Each cell has a ﬁxed number of gradient orientation bins, which are
evenly spread over 0–180ı or 0–360ı , depending on whether the gradient is unsigned or signed.
Each pixel within the cell casts a weighted vote for a gradient orientation bin based on the
gradient magnitude at that pixel. For the vote weight, the pixel contribution can be the gradient
magnitude, or the square root of the gradient magnitude or the square of the gradient magnitude.
Block Descriptor
To deal with changes in illumination and contrast, the gradient strengths are locally normalized by grouping the cells together into larger, spatially connected blocks. The HOG descriptor
is then the vector of the components of the normalized cell histograms from all of the block
regions.
Block Normalization
The ﬁnal step is the normalization of the block descriptors. Let v be the non-normalized vector
containing all histograms in a given block, jjvjjk be its k -norm for k D 1; 2, and be a small
constant. Then the normalization factor can be one of the following:

L2-norm: v D q
or
L1-norm: v D
or
L1-sqrt: v D

v
jjvjj1 C

;

(2.4)

jjvjj22 C e 2

v
jjvjj1 C

;

(2.5)

(2.6)

There is another normalization factor, L2-Hys, which is obtained by clipping the L2-norm
of v (i.e., limiting the maximum values of v to 0:2) and then re-normalizing.
The ﬁnal image/RoI descriptor is formed by concatenating all normalized block descriptors. The experimental results in Triggs and Dalal [2005] show that all four block normalization
methods achieve a very signiﬁcant improvement over the non-normalized one. Moreover, the
L2-norm, L2-Hys, and L1-sqrt normalization approaches provide a similar performance, while
the L1-norm provides a slightly less reliable performance.

2. FEATURES AND CLASSIFIERS

Original Image

Gradient

Cell Orientation Histogram

Block Descriptor

Figure 2.3: HOG descriptor. Note that for better visualization, we only show the cell orientation
histogram for four cells and a block descriptor corresponding to those four cells.

2.2.2 SCALE-INVARIANT FEATURE TRANSFORM (SIFT)
SIFT [Lowe, 2004] provides a set of features of an object that are are robust against object
scaling and rotations. The SIFT algorithm consists of four main steps, which are discussed in
the following subsections.
Scale-space Extrema Detection
The ﬁrst step aims to identify potential keypoints that are invariant to scale and orientation.
While several techniques can be used to detect keypoint locations in the scale-space, SIFT uses
the Diﬀerence of Gaussians (DoG), which is obtained as the diﬀerence of Gaussian blurring of
an image with two diﬀerent scales, , one with scale k times the scale of the other, i.e., k .
This process is performed for diﬀerent octaves of the image in the Gaussian Pyramid, as shown
in Fig. 2.4a.
Then, the DoG images are searched for local extrema over all scales and image locations.
For instance, a pixel in an image is compared with its eight neighbors in the current image as
well as nine neighbors in the scale above and below, as shown in Fig. 2.4b. If it is the minimum
or maximum of all these neighbors, then it is a potential keypoint. It means that a keypoint is
best represented in that scale.

2.2. TRADITIONAL FEATURE DESCRIPTORS

Scale
(next
octave)

Scale
(ﬁrst
octave)

Scale

Diﬀerence of
Gaussian (DOG)

Gaussian
(a)

(b)

Figure 2.4: Scale-space feature detection using a sub-octave DoG pyramid. (a) Adjacent levels
of a sub-octave Gaussian pyramid are subtracted to produce the DoG; and (b) extrema in the
resulting 3D volume are detected by comparing a pixel to its 26 neighbors. (Figure from Lowe
[2004], used with permission.)

Accurate Keypoint Localization
This step removes unstable points from the list of potential keypoints by ﬁnding those that have
low contrast or are poorly localized on an edge. In order to reject low contrast keypoints, a Taylor
series expansion of the scale space is computed to get more accurate locations of extrema, and if
the intensity at each extrema is less than a threshold value, the keypoint is rejected.
Moreover, the DoG function has a strong response along the edges, which results in a large
principal curvature across the edge but a small curvature in the perpendicular direction in the
DoG function. In order to remove the keypoints located on an edge, the principal curvature at
the keypoint is computed from a 2 2 Hessian matrix at the location and scale of the keypoint.
If the ratio between the ﬁrst and the second eigenvalues is greater than a threshold, the keypoint
is rejected.

Remark: In mathematics, the Hessian matrix or Hessian is a square matrix
of second-order partial derivatives of a scalar-valued function. Speciﬁcally,
suppose f .x1 ; x2 ; ; xn / is a function outputting a scalar, i.e., f W Rn ! R;
if all the second partial derivatives of f exist and are continuous over the

2. FEATURES AND CLASSIFIERS

domain of the function, then the Hessian H of f is a square n n matrix,
deﬁned as follows:
2 2
3
@ f
@2 f
@2 f

2
@x1 @x2
@x1 @xn
6 @x
7
1
6 @2 f
@2 f
@2 f 7

7
6 @x2 @x1
2
@x
@x
n7
2
@x2
H D6
(2.7)
6 :
7:
:
:
::
::
:: 7
6 ::
:
5
4 2
@ f
@2 f
@2 f

@xn @x1
@xn @x2
@x 2
n

Orientation Assignment
In order to achieve invariance to image rotation, a consistent orientation is assigned to each
keypoint based on its local image properties. The keypoint descriptor can then be represented
relative to this orientation. The algorithm used to ﬁnd an orientation consists of the following
steps.

1. The scale of the keypoint is used to select the Gaussian blurred image with the closest
scale.
2. The gradient magnitude and orientation are computed for each image pixel at this scale.
3. As shown in Fig. 2.5, an orientation histogram, which consists of 36 bins covering the
360ı range of orientations, is built from the gradient orientations of pixels within a local
region around the keypoint.
4. The highest peak in the local orientation histogram corresponds to the dominant direction
of the local gradients. Moreover, any other local peak that is within 80% of the highest
peak is also considered as a keypoint with that orientation.
Keypoint Descriptor
The dominant direction (the highest peak in the histogram) of the local gradients is also used to
create keypoint descriptors. The gradient orientations are rotated relative to the orientation of
the keypoint and then weighted by a Gaussian with a variance of 1:5 keypoi nt scale . Then, a
16 16 neighborhood around the keypoint is divided into 16 sub-blocks of size 4 4. For each
sub-block, an 8 bin orientation histogram is created. This results in a feature vector, called SIFT
descriptor, containing 128 elements. Figure 2.6 illustrates the SIFT descriptors for keypoints
extracted from an example image.
Complexity of SIFT Descriptor
In summary, SIFT tries to standardize all images (if the image is blown up, SIFT shrinks it;
if the image is shrunk, SIFT enlarges it). This corresponds to the idea that if a keypoint can

2.2. TRADITIONAL FEATURE DESCRIPTORS

2π

Angle Histogram

Image Gradients

Figure 2.5: A dominant orientation estimate is computed by creating a histogram of all the
gradient orientations weighted by their magnitudes and then ﬁnding the signiﬁcant peaks in
this distribution.

SIFT
Detector

Original Image

SIFT
Descriptor

SIFT Keypoints

SIFT Descriptors

Figure 2.6: An example of the SIFT detector and descriptor: (left) an input image, (middle)
some of the detected keypoints with their corresponding scales and orientations, and (right)
SIFT descriptors–a 16 16 neighborhood around each keypoint is divided into 16 sub-blocks
of 4 4 size.
be detected in an image at scale , then we would need a larger dimension k to capture the
same keypoint, if the image was up-scaled. However, the mathematical ideas of SIFT and many
other hand-engineered features are quite complex and require many years of research. For example, Lowe [2004] spent almost 10 years on the design and tuning of the SIFT parameters.
As we will show in Chapters 4, 5, and 6, CNNs also perform a series of transformations on the
image by incorporating several convolutional layers. However, unlike SIFT, CNNs learn these
transformation (e.g., scale, rotation, translation) from image data, without the need of complex
mathematical ideas.

2.2.3 SPEEDED-UP ROBUST FEATURES (SURF)
SURF [Bay et al., 2008] is a speeded up version of SIFT. In SIFT, the Laplacian of Gaussian is
approximated with the DoG to construct a scale-space. SURF speeds up this process by approx-

2. FEATURES AND CLASSIFIERS

imating the LoG with a box ﬁlter. Thus, a convolution with a box ﬁlter can easily be computed
with the help of integral images and can be performed in parallel for diﬀerent scales.
Keypoint Localization
In the ﬁrst step, a blob detector based on the Hessian matrix is used to localize keypoints. The
determinant of the Hessian matrix is used to select both the location and scale of the potential
keypoints. More precisely, for an image I with a given point p D .x; y/, the Hessian matrix
H.p; / at point p and scale , is deﬁned as follows:

Lxx .p; / Lxy .p; /
H.p; / D
;
Lxy .p; / Lyy .p; /

(2.8)
2

@
where Lxx .p; / is the convolution of the second-order derivative of the Gaussian, @x
2 g./, with
the image I at point p. However, instead of using Gaussian ﬁlters, SURF uses approximated
Gaussian second-order derivatives, which can be evaluated using integral images at a very low
computational cost. Thus, unlike SIFT, SURF does not require to iteratively apply the same
ﬁlter to the output of a previously ﬁltered layer, and scale-space analysis is done by keeping the
same image and varying the ﬁlter size, i.e., 9 9, 25 15, 21 21, and 27 27.
Then, a non-maximum suppression in a 3 3 3 neighborhood of each point in the image
is applied to localize the keypoints in the image. The maxima of the determinant of the Hessian
matrix are then interpolated in the scale and image space, using the method proposed by Brown
and Lowe [2002].

Orientation Assignment
In order to achieve rotational invariance, the Haar wavelet responses in both the horizontal
x and vertical y directions within a circular neighborhood of radius 6s around the keypoint
are computed, where s is the scale at which the keypoint is detected. Then, the Haar wavelet
responses in both the horizontal dx and vertical dy directions are weighted with a Gaussian
centered at a keypoint, and represented as points in a 2D space. The dominant orientation of
the keypoint is estimated by computing the sum of all the responses within a sliding orientation
window of angle 60ı . The horizontal and vertical responses within the window are then summed.
The two summed responses is considered as a local vector. The longest orientation vector over all
the windows determines the orientation of the keypoint. In order to achieve a balance between
robustness and angular resolution, the size of the sliding window need to be chosen carefully.
Keypoint Descriptor
To describe the region around each keypoint p, a 20s 20s square region around p is extracted
and then oriented along the orientation of p. The normalized orientation region around p is
split into smaller 4 4 square sub-regions. The Haar wavelet responses in both the horizontal
dx and vertical dy directions are extracted at 5 5 regularly spaced sample points for each sub-

2.3. MACHINE LEARNING CLASSIFIERS

region. In order to achieve more robustness to deformations, noise and translation, The Haar
wavelet responses are weighted with a Gaussian. Then, dx and dy are summed up over each subregion and the results form the ﬁrst set of entries in the feature vector. The sum of the absolute
values of the responses, jdxj and jdyj, are also computed and then added to the feature vector to
encode information about the intensity changes. Since each sub-region has a 4D feature vector,
concatenating all 4 4 sub-regions results in a 64D descriptor.

2.2.4

LIMITATIONS OF TRADITIONAL HAND-ENGINEERED
FEATURES
Until recently, progress in computer vision was based on hand-engineering features. However,
feature engineering is diﬃcult, time-consuming, and requires expert knowledge on the problem
domain. The other issue with hand-engineered features such as HOG, SIFT, SURF, or other
algorithms like them, is that they are too sparse in terms of information that they are able to
capture from an image. This is because the ﬁrst-order image derivatives are not suﬃcient features
for the purpose of most computer vision tasks such as image classiﬁcation and object detection.
Moreover, the choice of features often depends on the application. More precisely, these features do not facilitate learning from previous learnings/representations (transfer learning). In
addition, the design of hand-engineered features is limited by the complexity that humans can
put in it. All these issues are resolved using automatic feature learning algorithms such as deep
neural networks, which will be addressed in the subsequent chapters (i.e., Chapters 3, 4, 5, and
6).

2.3

MACHINE LEARNING CLASSIFIERS

Machine learning is usually divided into three main areas, namely supervised, unsupervised, and
semi-supervised. In the case of the supervised learning approach, the goal is to learn a mapping
from inputs to outputs, given a labeled set of input-output pairs. The second type of machine
learning is the unsupervised learning approach, where we are only given inputs, and the goal is to
automatically ﬁnd interesting patterns in the data. This problem is not a well-deﬁned problem,
because we are not told what kind of patterns to look for. Moreover, unlike supervised learning,
where we can compare our label prediction for a given sample to the observed value, there is
no obvious error metric to use. The third type of machine learning is semi-supervised learning,
which typically combines a small amount of labeled data with a large amount of unlabeled data
to generate an appropriate function or classiﬁer. The cost of the labeling process of a large corpus
of data is infeasible, whereas the acquisition of unlabeled data is relatively inexpensive. In such
cases, the semi-supervised learning approach can be of great practical value.
Another important class of machine learning algorithms is “reinforcement learning,”
where the algorithm allows agents to automatically determine the ideal behavior given an observation of the world. Every agent has some impact on the environment, and the environment
provides reward feedback to guide the learning algorithm. However, in this book our focus is

2. FEATURES AND CLASSIFIERS

mainly on the supervised learning approach, which is the most widely used machine learning
approach in practice.
A wide range of supervised classiﬁcation techniques has been proposed in the literature.
These methods can be divided into three diﬀerent categories, namely linear (e.g., SVM [Cortes,
1995]; logistic regression; Linear Discriminant Analysis (LDA) [Fisher, 1936]), nonlinear (e.g.,
Multi Layer Perceptron (MLP), kernel SVM), and ensemble-based (e.g., RDF [Breiman, 2001,
Quinlan, 1986]; AdaBoost [Freund and Schapire, 1997]) classiﬁers. The goal of ensemble methods is to combine the predictions of several base classiﬁers to improve generalization over a
single classiﬁer. The ensemble methods can be divided into two categories, namely averaging
(e.g., Bagging methods; Random Decision Forests [Breiman, 2001, Quinlan, 1986]) and boosting (e.g., AdaBoost [Freund and Schapire, 1997]; Gradient Tree Boosting [Friedman, 2000]).
In the case of the averaging methods, the aim is to build several classiﬁers independently and
then to average their predictions. For the boosting methods, base “weak” classiﬁers are built sequentially and one tries to reduce the bias of the combined overall classiﬁer. The motivation is
to combine several weak models to produce a powerful ensemble.
Our deﬁnition of a machine learning classiﬁer that is capable of improving computer
vision tasks via experience is somewhat abstract. To make this more concrete, in the following
we describe three widely used linear (SVM), nonlinear (kernel SVM) and ensemble (RDF)
classiﬁers in some detail.

2.3.1 SUPPORT VECTOR MACHINE (SVM)
SVM [Cortes, 1995] is a supervised machine learning algorithm used for classiﬁcation or regression problems. SVM works by ﬁnding a linear hyperplane which separates the training dataset
into two classes. As there are many such linear hyperplanes, the SVM algorithm tries to ﬁnd
the optimal separating hyperplane (as shown in Fig. 2.7) which is intuitively achieved when the
distance (also known as the margin) to the nearest training data samples is as large as possible.
It is because, in general, the larger the margin the lower the generalization error of the model.
Mathematically, SVM is a maximum margin linear model. Given a training dataset of n
samples of the form f.x1 ; y1 /; ; .xn ; yn /g, where xi is an m-dimensional feature vector and
yi D f1; 1g is the class to which the sample xi belongs to. The goal of SVM is to ﬁnd the
maximum-margin hyperplane which divides the group of samples for which yi D 1 from the
group of samples for which yi D 1. As shown in Fig. 2.7b (the bold blue line), this hyperplane
can be written as the set of sample points satisfying the following equation:
wT xi C b D 0;

(2.9)

where w is the normal vector to the hyperplane. More precisely, any samples above the hyperplane should have label 1, i.e., xi s.t. wT xi C b > 0 will have corresponding yi D 1. Similarly,
any samples below the hyperplane should have label 1, i.e., xi s.t. wT xi C b < 0 will have
corresponding yi D 1.

2.3. MACHINE LEARNING CLASSIFIERS

Notice that there is some space between the hyperplane (or decision boundary, which is
the bold blue line in Fig. 2.7b) and the nearest data samples of either class. Thus, the sample
data is rescaled such that anything on or above the hyperplane wT xi C b D 1 is of one class with
label 1, and anything on or below the hyperplane wT xi C b D 1 is of the other class with label
1. Since these two new hyperplanes are parallel, the distance between them is p 2T , as shown
w w
in Fig. 2.7c.
Recall that SVM tries to maximize the distance between these two new hyperplanes deT
marcating two classes, which is equivalent to minimizing w 2 w . Thus, SVM is learned by solving
the following primal optimization problem:
min
w;b

wT w
2

(2.10)

subject to:
yi .wT xi C b/ 1 .8 samples xi /:

(2.11)

Soft-margin Extension
In the case where the training samples are not perfectly linearly separable, SVM can allow some
samples of one class to appear on the other side of the hyperplane (boundary) by introducing
slack variables, an i for each sample xi as follows:

min

w;b;

X
wT w
CC
i ;
2
i

(2.12)

i and i 0 .8 samples xi /:

(2.13)

subject to:
yi .wT xi C b/ 1

Unlike deep neural networks, which will be introduced in Chapter 3, linear SVMs can
only solve problems that are linearly separable, i.e., where the data samples belonging to class
1 can be separated from the samples belonging to class 2 by a hyperplane as shown in Fig. 2.7.
However, in many cases, the data samples are not linearly separable.
Nonlinear Decision Boundary
SVM can be extended to nonlinear classiﬁcation by projecting the original input space (Rd ) into
a high-dimensional space (RD ), where a separating hyperplane can hopefully be found. Thus,
the formulation of the quadratic programming problem is as above (Eq. (2.12) and Eq. (2.13)),
but with all xi replaced with .xi /, where provides a higher-dimensional mapping.

min

w;b;

X
wT w
CC
i ;
2
i

(2.14)

subject to:
yi .wT .xi / C b/ 1

i and i 0 .8 samples xi /:

(2.15)

2. FEATURES AND CLASSIFIERS

M
arg
in

=
√
W

cis
ion

da
ry

W TX + b = 1
W TX + b = 0
W TX + b = -1

(a)
= 2
W T
W

Nonlinear Decision
Boundary

ε2

arg
M

(b)

cis

ion
Bo
u

ε1

ary

W TX + b = 1
W TX + b = 0
W TX + b = -1

(c)

(d)

Linear Decision Boundary

(e)

Figure 2.7: (Continues.)

2.3. MACHINE LEARNING CLASSIFIERS

Figure 2.7: (Continued.) For two classes, separable training datasets, such as the one shown
in (a), there are lots of possible linear separators as shown with the blue lines in (a). Intuitively, a
separating hyperplane (also called decision boundary) drawn in the middle of the void between
the data samples of the two classes (the bold blue line in (b)) seems better than the ones shown
in (a). SVM deﬁnes the criterion for a decision boundary that is maximally far away from any
data point. This distance from the decision surface to the closest data point determines the
margin of the classiﬁer as shown in (b). In the hard-margin SVM, (b), a single outlier can
determine the decision boundary, which makes the classiﬁer overly sensitive to the noise in the
data. However, a soft margin SVM classiﬁer, shown in (c), allows some samples of each class
to appear on the other side of the decision boundary by introducing slack variables i for each
sample. (d) Shows an example where classes are not separable by a linear decision boundary.
Thus, as shown in (e), the original input space R2 is projected onto R3 , where a linear decision
boundary can be found, i.e., using the kernel trick.
Dual Form of SVM
In the case that D d , there are many more parameters to learn for w. In order to avoid that,
the dual form of SVM is used for optimization problem.

max
˛

subject to:

X
i

˛i

1X
˛i ˛j yi yj .xi /T .xj /;
2

(2.16)

i;j

˛i yi D 0 and 0 ˛i C;

(2.17)

where C is a hyper-parameter which controls the degree of misclassiﬁcation of the model, in
case classes are not linearly separable.
Kernel Trick
Since .xi / is in a high-dimensional space (even inﬁnite-dimensional space), calculating .xi /T
.xj / may be intractable. However, there are special kernel functions, such as linear, polynomial,
Gaussian, and Radial Basis Function (RBF), which operate on the lower dimension vectors xi
and xj to produce a value equivalent to the dot-product of the higher-dimensional vectors. For
example, consider the function W R3 ! R10 , where
p
p
p
p
p
p
.x/ D .1; 2x.1/ ; 2x.2/ ; 2x.3/ ; Œx.1/ 2 ; Œx.2/ 2 ; Œx.3/ 2 ; 2x.1/ x.2/ ; 2x.1/ x.3/ ; 2x.2/ x.3/ /:
(2.18)
It is interesting to note that for the given function in Eq. (2.18),

K.xi ; xj / D .1 C xi T xj /2 D .xi /T .xj /:

(2.19)

2. FEATURES AND CLASSIFIERS

Thus, instead of calculating .xi /T .xj /, the polynomial kernel function K.xi ; xj / D
.1 C xi T xj /2 is calculated to produce a value equivalent to the dot-product of the higherdimensional vectors, .xi /T .xj /.
Note that the dual optimization problem is exactly the same, except that the dot product
T
.xi / .xj / is replaced by a kernel K.xi ; xj /, which corresponds to the dot product of .xi /
and .xj / in the new space.
max
˛

subject to:

X
i

˛i

1X
˛i ˛j yi yj K.xi ; xj /;
2

(2.20)

i;j

˛i yi D 0 and 0 ˛i C:

(2.21)

In summary, linear SVMs can be thought of as a single layer classiﬁer and kernel SVMs
can be thought of as a 2 layer neural network. However, unlike SVMs, Chapter 3 shows that
deep neural networks are typically built by concatenating several nonlinear hidden layers, and
thus, can extract more complex pattern from data samples.

2.3.2 RANDOM DECISION FOREST
A random decision forest [Breiman, 2001, Quinlan, 1986] is an ensemble of decision trees.
As shown in Fig. 2.8a, each tree consists of split and leaf nodes. A split node performs binary
classiﬁcation based on the value of a particular feature of the features vector. If the value of the
particular feature is less than a threshold, then the sample is assigned to the left partition, else to
the right partition. Figure 2.8b shows an illustrative decision tree used to ﬁgure out whether a
photo represents and indoor or an outdoor scene. If the classes are linearly separable, after log2 .c/
decisions each sample class will get separated from the remaining c 1 classes and reach a leaf
node. For a given feature vector f, each tree predicts independently its label and a majority voting
scheme is used to predict the ﬁnal label of the feature vector. It has been shown that random
decision forests are fast and eﬀective multi-class classiﬁers [Shotton et al., 2011].
Training
Each tree is trained on a randomly selected samples of the training data (usually 23 samples of the
training data). The remaining samples of the training data are used for validation. A subset of
features is randomly selected for each split node. Then, we search for the best feature fŒi and an
associated threshold i that maximize the information gain of the training data after partitioning.
Let H.Q/ be the original entropy of the training data and H.QjffŒi ; i g/ the entropy of Q after
partitioning it into “left” and “right” partitions, Ql and Qr . The information gain, G , is given by

G.QjffŒi; i g/ D H.Q/

H.QjffŒi; i g/;

(2.22)

2.3. MACHINE LEARNING CLASSIFIERS

Input

Split Node

Is top part blue?
False
True
Is bottom
Is bottom
part green?
part blue?
False
False
True
True

Root Node

Leaf Node

False
Indoor

(a)

True
Outdoor

(b)

Figure 2.8: (a) A decision tree is a set of nodes and edges organized in a hierarchical fashion. The
split (or internal) nodes are denoted with circles and the leaf (or terminal) nodes with squares.
(b) A decision tree is a tree where each split node stores a test function which is applied to the
input data. Each leaf stores the ﬁnal label (here whether “indoor” or “outdoor”).
where
H.QjffŒi; i g/ D

jQl j
jQr j
H.Ql / C
H.Qr /;
jQj
jQj

(2.23)

and jQl j and jQr j denote the number of data samples in the left and right partitions. The entropy
of Ql is given by
X
H.Ql / D
pi log2 pi ;
(2.24)
i 2Ql

where pi is the number of samples of class i in Ql divided by jQl j. The feature and the associated
threshold which maximize the gain are selected as the splitting test for that node
ffg Œi ; i g D arg max G.Qjffg Œi ; i g/:
ffg Œi;i g

Entropy and Information Gain: The entropy and information gain are two
important concepts in the RDF training process. These concepts are usually
discussed in information theory or probability courses and are brieﬂy discussed below.
Information entropy is deﬁned as a measure of the randomness in
the information being processed. More precisely, the higher the entropy,

(2.25)

2. FEATURES AND CLASSIFIERS

the lower the information content. Mathematically, given a discrete random
variable X with possible values fx1 ; ; xn g and a probability mass function
P .X /, the entropy H (also called Shannon entropy) can be written as follows:
H.X / D

n
X

P .xi / log2 P .xi /:

(2.26)

i D1

For example, an action such as ﬂipping a coin that has no aﬃnity for
“head” or “tail,” provides information that is random (X with possible values
of f“head,” “tail”g). Therefore, Eq. (2.26) can be written as follows:
H.X / D

P .“head”/ log2 P .“head”/

P .“tail”/ log2 P .“tail”/:

(2.27)

As shown in Fig. 2.9, this binary entropy function (Eq. (2.27)) reaches
its maximum value (uncertainty is at a maximum) when the probability is
1
, meaning that P .X D “head”/ D 21 or similarly P .X D “tail”/ D 12 . The
2
entropy function reaches its minimum value (i.e., zero) when probability is
1 or 0 with complete certainty P .X D “head” D 1/ or P .X D “head” D 0/
respectively.
Information gain is deﬁned as the change in information entropy H
from a prior state to a state that takes some information (t ) and can be written
as follows:
G.Qjt / D H.Q/ H.Qjt /:
(2.28)
If a partition consists of only a single class, it is considered as a leaf node. Partitions
consisting of multiple classes are further partitioned until either they contain single classes or
the tree reaches its maximum height. If the maximum height of the tree is reached and some
of its leaf nodes contain labels from multiple classes, the empirical distribution over the classes
associated with the subset of the training samples v, which have reached that leaf, is used as
its label. Thus, the probabilistic leaf predictor model for the t -th tree is p t .cjv/, where c 2 fck g
denotes the class.
Classiﬁcation
Once a set of decision trees have been trained, given a previously unseen sample xj , each decision
tree hierarchically applies a number of predeﬁned tests. Starting at the root, each split node
applies its associated split function to xj . Depending on the result of the binary test the data is
sent to the right or the left child. This process is repeated until the data point reaches a leaf node.
Usually the leaf nodes contain a predictor (e.g., a classiﬁer) which associates an output (e.g., a
class label) to the input xj . In the case of forests, many tree predictors are combined together to

2.4. CONCLUSION

1
0.9
0.8
0.7

Entropy

0.6
0.5
0.4
0.3
0.2
0.1
0

0.2

0.4

0.6

0.8

Probability

Figure 2.9: Entropy vs. probability for a two class variable.
form a single forest prediction:
p.cjxj / D

T
1 X
p .cjxj /;
T tD1 t

(2.29)

where T denotes the number of decision trees in the forest.

2.4

CONCLUSION

Traditional computer vision systems consist of two steps: feature design and learning algorithm
design, both of which are largely independent. Thus, computer vision problems have traditionally been approached by designing hand-engineered features such as HOG [Triggs and Dalal,
2005], SIFT [Lowe, 2004], and SURF [Bay et al., 2008] that lack in generalizing well to other
domains, are time consuming, expensive, and require expert knowledge on the problem domain. These feature engineering processes are usually followed by learning algorithms such as
SVM [Cortes, 1995] and RDF [Breiman, 2001, Quinlan, 1986]. However, progress in deep
learning algorithms resolves all these issues, by training a deep neural network for feature extraction and classiﬁcation in an end-to-end learning framework. More precisely, unlike traditional approaches, deep neural networks learn to simultaneously extract features and classify data
samples. Chapter 3 will discuss deep neural networks in detail.

2. FEATURES AND CLASSIFIERS

!=2

#2(%| )

#1(%| )

!=1

!=3

%
3

RDF output probability is: #(%| ) = 1 ' #!(%| )
3
!=1

#(%| )

#3(%| )

Figure 2.10: RDF classiﬁcation for a test sample xj . During testing the same test sample is
passed through each decision tree. At each internal node a test is applied and the test sample
is sent to the appropriate child. This process is repeated until a leaf is reached. At the leaf the
stored posterior p t .cjxj / is read. The forest class posterior p.cjxj / is the average of all decision
tree posteriors.

CHAPTER

Neural Networks Basics
3.1

INTRODUCTION

Before going into the details of the CNNs, we provide in this chapter an introduction to artiﬁcial neural networks, their computational mechanism, and their historical background. Neural
networks are inspired by the working of cerebral cortex in mammals. It is important to note,
however, that these models do not closely resemble the working, scale and complexity of the
human brain. Artiﬁcial neural network models can be understood as a set of basic processing
units, which are tightly interconnected and operate on the given inputs to process the information and generate desired outputs. Neural networks can be grouped into two generic categories
based on the way the information is propagated in the network.
• Feed-forward networks
The information ﬂow in a feed-forward network happens only in one direction. If the
network is considered as a graph with neurons as its nodes, the connections between the
nodes are such that there are no loops or cycles in the graph. These network architectures
can be referred as Directed Acyclic Graphs (DAG). Examples include MLP and CNNs,
which we will discuss in details in the upcoming sections.
• Feed-back networks
As the name implies, feed-back networks have connections which form directed cycles (or
loops). This architecture allows them to operate on and generate sequences of arbitrary
sizes. Feed-back networks exhibit memorization ability and can store information and
sequence relationships in their internal memory. Examples of such architectures include
Recurrent Neural Network (RNN) and Long-Short Term Memory (LSTM).
We provide an example architecture for both feed-forward and feed-back networks in
Sections 3.2 and 3.3, respectively. For feed-forward networks, we ﬁrst study MLP, which is a
simple case of such architectures. In Chapter 4, we will cover the CNNs in detail, which also
work in a feed-forward manner. For feed-back networks, we study RNNs. Since our main focus
here is on CNNs, an in-depth treatment of RNNs is out of the scope of this book. We refer
interested readers to Graves et al. [2012] for RNN details.

3. NEURAL NETWORKS BASICS

3.2

MULTI-LAYER PERCEPTRON

3.2.1 ARCHITECTURE BASICS
Figure 3.1 shows an example of a MLP network architecture which consists of three hidden
layers, sandwiched between an input and an output layer. In simplest terms, the network can
be treated as a black box, which operates on a set of inputs and generates some outputs. We
highlight some of the interesting aspects of this architecture in more details below.
Input
Layer

Hidden Layers

Artificial Neurons or Units

Output
Layer

Connections between Units

Figure 3.1: A simple feed-forward neural network with dense connections.
Layered Architecture: Neural networks comprise a hierarchy of processing levels. Each
level is called a “network layer” and consists of a number of processing “nodes” ( also called “neurons” or “units”). Typically, the input is fed through an input layer and the ﬁnal layer is the output
layer which makes predictions. The intermediate layers perform the processing and are referred
to as the hidden layers. Due to this layered architecture, this neural network is called an MLP.
Nodes: The individual processing units in each layer are called the nodes in a neural network architecture. The nodes basically implement an “activation function” which given an input,
decides whether the node will ﬁre or not.
Dense Connections: The nodes in a neural network are interconnected and can communicate with each other. Each connection has a weight which speciﬁes the strength of the
connection between two nodes. For the simple case of feed-forward neural networks, the information is transferred sequentially in one direction from the input to the output layers. Therefore,
each node in a layer is directly connected to all nodes in the immediate previous layer.

3.2.2 PARAMETER LEARNING
As we described in Section 3.2.1, the weights of a neural network deﬁne the connections between
neurons. These weights need to be set appropriately so that a desired output can be obtained from
the neural network. The weights encode the “model” generated from the training data that is used
to allow the network to perform a designated task (e.g., object detection, recognition, and/or

3.2. MULTI-LAYER PERCEPTRON

classiﬁcation). In practical settings, the number of weights is huge which requires an automatic
procedure to update their values appropriately for a given task. The process of automatically
tuning the network parameters is called “learning” which is accomplished during the training
stage (in contrast to the test stage where inference/prediction is made on “unseen data,” i.e., data
that the network has not “seen” during training). This process involves showing examples of the
desired task to the network so that it can learn to identify the right set of relationships between
the inputs and the required outputs. For example, in the paradigm of supervised learning, the
inputs can be media (speech, images) and the outputs are the desired set of “labels” (e.g., identity
of a person) which are used to tune the neural network parameters.
We now describe a basic form of learning algorithm, which is called the Delta Rule.
Delta Rule
The basic idea behind the delta rule is to learn from the mistakes of the neural network during
the training phase. The delta rule was proposed by Widrow et al. [1960], which updates the
network parameters (i.e., weights denoted by , considering 0 biases) based on the diﬀerence
between the target output and the predicted output. This diﬀerence is calculated in terms of the
Least Mean Square (LMS) error, which is why the delta learning rule is also referred to as the
LMS rule. The output units are a “linear function” of the inputs denoted by x , i.e.,
X
pi D
ij xj :
j

If pn and yn denote the predicted and target outputs, respectively, the error can be calculated as:
ED

1X
.yn
2 n

pn /2 ;

(3.1)

where n is the number of categories in the dataset (or the number of neurons in the output
layer). The delta rule calculates the gradient of this error function (Eq. (3.1)) with respect to
the parameters of the network: @E=@ij . Given the gradient, the weights are updated iteratively
according to the following learning rule:
@E
@ij
D ijt C .yi pi /xj ;

ijt C1 D ijt C

(3.2)

ijt C1

(3.3)

where t denotes the previous iteration of the learning process. The hyper-parameter denotes
the step size of the parameter update in the direction of the calculated gradient. One can visualize
that no learning happens when the gradient or the step size is zero. In other cases, the parameters
are updated such that the predicted outputs get closer to the target outputs. After a number of
iterations, the network training process is said to converge when the parameters do not change
any longer as a result of the update.

3. NEURAL NETWORKS BASICS

If the step size is unnecessarily too small, the network will take longer to converge and
the learning process will be very slow. On the other hand, taking very large steps can result in an
unstable erratic behavior during the training process as a result of which the network may not
converge at all. Therefore, setting the step-size to a right value is really important for network
training. We will discuss diﬀerent approaches to set the step size in Section 5.3 for CNN training
which are equally applicable to MLP.
Generalized Delta Rule
The generalized delta rule is an extension of the delta rule. It was proposed by Rumelhart et al.
[1985]. The delta rule only computes linear combinations between the input and the output
pairs. This limits us to only a single-layered network because a stack of many linear layers is not
better than a single linear transformation. To overcome this limitation, the generalized delta rule
makes use of nonlinear activation functions at each processing unit to model nonlinear relationships between the input and output domains. It also allows us to make use of multiple hidden
layers in the neural network architecture, a concept which forms the heart of deep learning.
The parameters of a multi-layered neural network are updated in the same manner as the
delta rule, i.e.,

ijt C1 D ijt C

@E
:
@ij

(3.4)

But diﬀerent to the delta rule, the errors are recursively sent backward through the multi-layered
network. For this reason, the generalized delta rule is also called the “back-propagation” algorithm. Since for the case of the generalized delta rule, a neural network not only has an output
layer but also intermediate hidden layers, we can separately calculate the error term (diﬀerential
with respect to the desired output) for the output and hidden layers. Since the case of the output
layer is simple, we ﬁrst discuss the error computation for this layer.
Given the error function in Eq. (3.1), its gradient with respect to the parameters in the
output layer L for each node i can be computed as follows:
@E
D ıiL xj ;
@ijL
ıiL D .yi pi /fi0 .ai /;

(3.5)
(3.6)

P
where, ai D j i;j xj C bi is the activation which is the input to the neuron (prior to the activation function), xj ’s are the outputs from the previous layer, pi D f .ai / is the output from the
neuron (prediction for the case of output layer) and f ./ denotes a nonlinear activation function
while f 0 ./ represents its derivative. The activation function decides whether the neuron will ﬁre
or not, in response to a given input activation. Note that the nonlinear activation functions are
diﬀerentiable so that the parameters of the network can be tuned using error back-propagation.

3.2. MULTI-LAYER PERCEPTRON

One popular activation function is the sigmoid function, given as follows:
pi D f .ai / D

1
:
1 C exp. ai /

(3.7)

We will discuss other activation functions in detail in Section 4.2.4. The derivative of the sigmoid
activation function is ideally suitable because it can be written in terms of the sigmoid function
itself (i.e., pi ) and is given by:
fi0 .ai / D pi .1

(3.8)

pi /:

Therefore, we can write the gradient equation for the output layer neurons as follows:
@E
D .yi
@ijL

pi /.1

pi /xj pi :

(3.9)

Similarly, we can calculate the error signal for the intermediate hidden layers in a multi-layered
neural network architecture by back propagation of errors as follows:
X
(3.10)
ijlC1 ıjlC1 ;
ıil D f 0 .ail /
j

where l 2 f1 : : : L 1g and L denotes the total number of layers in the network. The above
equation applies the chain rule to progressively calculate the gradients of the internal parameters
using the gradients of all subsequent layers. The overall update equation for the MLP parameters
ij can be written as:
ijt C1 D ijt C ıil xjl

;

(3.11)

where xjl 1 is the output from the previous layer and t denotes the number of previous training
iteration. The complete learning process usually involves a number of iterations and the parameters are continually updated until the network is optimized (i.e., after a set number of iterations
or when ijt C1 does not change).
Gradient Instability Problem: The generalized delta rule successfully works
for the case of shallow networks (ones with one or two hidden layers). However, when the networks are deep (i.e., L is large), the learning process can
suﬀer from the vanishing or exploding gradient problems depending on the
choice of the activation function (e.g., sigmoid in above example). This instability relates particularly to the initial layers in a deep network. As a result,
the weights of the initial layers cannot be properly tuned. We explain this
with an example below.

3. NEURAL NETWORKS BASICS

Consider a deep network with many layers. The outputs of each weight
layer are squashed within a small range using an activation function (e.g.,
[0,1] for the case of the sigmoid). The gradient of the sigmoid function
leads to even smaller values (see Fig. 3.2). To update the initial layer parameters, the derivatives are successively multiplied according to the chain
rule (as in Eq. (3.10)). These multiplications exponentially decay the backpropagated signal. If we consider a network depth of 5, and the maximum
possible gradient value for the sigmoid (i.e., 0:25), the decaying factor would
be .0:25/5 D 0:0009. This is called the “vanishing gradient” problem. It is easy
to follow that in cases where the gradient of the activation function is large,
successive multiplications can lead to the “exploding gradient” problem.
We will introduce the ReLU activation function in Chapter 4, whose
gradient is equal to 1 (when a unit is “on”). Since 1L D 1, this avoids both
the vanishing and the exploding gradient problems.

Derivative of Sigmoid
0.25

0.8

0.2

Output

Sigmoid Activation Function
1

0.6
0.4
0.2

0.15
0.1
0.05

0
-5

-5

Input

Figure 3.2: The sigmoid activation function and its derivative. Note that the range of values for
the derivative is relatively small which leads to the vanishing gradient problem.

3.3

RECURRENT NEURAL NETWORKS

The feed-back networks contain loops in their network architecture, which allows them to process sequential data. In many applications, such as caption generation for an image, we want to
make a prediction such that it is consistent with the previously generated outputs (e.g., already
generated words in the caption). To accomplish this, the network processes each element in an
input sequence in a similar fashion (while considering the previous computational state). For
this reason it is also called an RNN.

3.3. RECURRENT NEURAL NETWORKS

Since, RNNs process information in a manner that is dependent on the previous computational states, they provide a mechanism to “remember” previous states. The memory mechanism
is usually eﬀective to remember only the short term information that is previously processed by
the network. Below, we outline the architectural details of an RNN.

3.3.1 ARCHITECTURE BASICS
A simple RNN architecture is shown in Fig. 3.3. As described above, it contains a feed-back
loop whose working can be visualized by unfolding the recurrent network over time (shown on
the right). The unfolded version of the RNN is very similar to a feed-forward neural network
described in Section 3.2. We can, therefore, understand RNN as a simple multi-layered neural
network, where the information ﬂow happens over time and diﬀerent layers represent the computational output at diﬀerent time instances. The RNN operates on sequences and therefore the
input and consequently the output at each time instance also varies.

Output Layer

Hidden Layer

Input Layer
(a)

(b)

Figure 3.3: The RNN Architecture. Left: A simple recurrent network with a feed-back loop.
Right: An unfolded recurrent architecture at diﬀerent time-steps.
We highlight the key features of an RNN architecture below.
Variable Length Input: RNN can operate on inputs of variable length, e.g., videos with
variable frame length, sentences with diﬀerent number of words, 3D point clouds with variable
number of points. The length of the unfolded RNN structure depends on the length of the
input sequence, e.g., for a sentence consisting of 12 words, there will be a total of 12 layers in
the unfolded RNN architecture. In Fig. 3.3, the input to the network at each time instance t is
represented by the variable x t .
Hidden State: The RNN holds internally the memory of the previous computation in a
hidden state represented by h t . This state can be understood as an input from the previous layer
in the unfolded RNN structure. At the beginning of the sequence processing, it is initialized with
a zero or a random vector. At each time step, this state is updated by considering its previous

3. NEURAL NETWORKS BASICS

value and the current input to the network:
h t D f .Ax t C Bh t

1 /;

(3.12)

where f ./ is the nonlinear activation function. The weight matrix B is called the transition
matrix since it inﬂuences how the hidden state changes over time.
Variable Length Output: The output of RNN at each time step is denoted by y t . The
RNNs are capable of generating variable length outputs, e.g., translating a sentence in one language to another language where the output sequence lengths can be diﬀerent from the input
sequence length. This is possible because RNNs consider the hidden state while making predictions. The hidden state models the joint probability of the previously processed sequence which
can be used to predict new outputs. As an example, given a few starting words in a sentence,
the RNN can predict the next possible word in the sentence, where a special end of a sentence
symbol is used to denote the end of each sentence. In this case all possible words (including the
end of sentence symbol) are included in the dictionary over which the prediction is made:
y t D f .C h t /;

(3.13)

where f ./ is an activation function such as a softmax (Section 4.2.4).
Shared Parameters: The parameters in the unfolded RNN linking the input, the hidden
state and the output (denoted by A; B , and C , respectively) are shared between all layers. This
is the reason why the complete architecture can be represented by using a loop to represent its
recurrent architecture. Since the parameters in a RNN are shared, the total number of tunable
parameters are considerably less than an MLP, where a separate set of parameters are learned for
each layer in the network. This enables eﬃcient training and testing of the feed-back networks.
Based on the above description of RNN architecture, it can be noted that indeed the hidden state of the network provides a memory mechanism, but it is not eﬀective when we want
to remember long-term relationships in the sequences. Therefore, RNN only provide shortterm memory and ﬁnd diﬃculties in “remembering” (a few time-steps away) old information
processed through it. To overcome this limitation, improved versions of recurrent networks
have been introduced in the literature which include the LSTM [Hochreiter and Schmidhuber, 1997], Gated Recurrent Unit (GRU) [Cho et al., 2014], Bi-directional RNN (B-RNN)
[Graves and Schmidhuber, 2005] and Neural Turing Machines (NTM) [Graves et al., 2014].
However, the details of all these network architectures and their functioning is out of the scope
of this book which is focused on feed-forward architectures (particularly CNNs).

3.3.2 PARAMETER LEARNING
The parameters in a feed-back network can be learned using the generalized delta rule (backpropagation algorithm), similar to feed-forward networks. However, instead of error backpropagation through network layers as in feed-forward networks, back-propagation is performed

3.4. LINK WITH BIOLOGICAL VISION

through time in the feed-back networks. At each time instance, the output of the RNN is computed as a function of its previous and current inputs. The Back Propagation Through Time
(BPTT) algorithm cannot allow the learning of long-term relationships in sequences, because
of diﬃculties in error computations over long sequences. Speciﬁcally, when the number of iterations increases, the BPTT algorithm suﬀers from the vanishing or the exploding gradient
problem. One way around this problem is to compute the error signal over a truncated unfolded
RNN. This reduces the cost of the parameter update process for long sequences, but limits the
dependence of the output at each time instance to only few previous hidden states.

3.4

LINK WITH BIOLOGICAL VISION

We think it is important to brieﬂy discuss biological neural networks (BNNs) and their operational mechanisms in order to study their similarities and dissimilarities with artiﬁcial neural
networks. As a matter of fact, artiﬁcial neural networks do not resemble their biological counterparts in terms of functioning and scale, however they are indeed motivated by the BNNs and
several of the terms used to describe artiﬁcial neural networks are borrowed from the neuroscience literature. Therefore, we introduce neural networks in the brain, draw parallels between
artiﬁcial and biological neurons, and provide a model of the artiﬁcial neuron based on biological
vision.

3.4.1 BIOLOGICAL NEURON
The human brain contains approximately 100 billion neurons. To interpret this number, let us
assume we have 100 billion one dollar bills, where each bill is only 0.11 mm thick. If we stack
all these one dollar bills on top of each other, the resulting stack will be 10922.0 km high. This
illustrates the scale and magnitude of the human brain.
A biological neuron is a nerve cell which processes information [Jain et al., 1996]. Each
neuron is surrounded by a membrane and has a nucleus which contains genes. It has specialized
projections which manage the input and output to the nerve cell. These projections are termed
dendrites and axons. We describe these and other key aspects of the biological neuron below.
Dendrites: Dendrites are ﬁbers which act as receptive lines and bring information (activations) to the cell body from other neurons. They are the inputs of the neuron.
Axons: Axons are ﬁbers which act as transmission lines and take information away from
the cell body to other neurons. They act as outputs of the neuron.
Cell body: The cell body (also called the soma) receives the incoming information through
dendrites, processes it and sends it, to other neurons via axons.
Synapses: The specialized connections between axons and dendrites which allow the communication of signals are called synapses. The communication takes place by an electro-chemical

3. NEURAL NETWORKS BASICS

process where the neurotransmitters (chemicals) are released at the synapse and are diﬀused
across the synaptic gap to transmit information. There is a total of approximately 1 quadrillion
(1015 ) synapses in the human brain [Changeux and Ricoeur, 2002].
Connections: Neurons are densely inter-connected with each other. On average, each
neuron receives inputs from approximately 105 synapses.
Neuron Firing: A neuron receives signals from connected neurons via dendrites. The cell
body sums up the received signals and the neuron ﬁres if the combined input signal exceeds a
threshold. By neuron ﬁring, we mean that it generates an output which is sent out through axons.
If the combined input is below the threshold, no response signal is generated by the neuron (i.e.,
the neuron does not ﬁre). The thresholding function which decides whether a neuron ﬁres or
not is called activation function.
Next, we describe a simple computational model which mimics the working of a biological
neuron.

Figure 3.4: A biological neuron (left) and a computational model (right) which is used to develop
artiﬁcial neural networks.

3.4.2 COMPUTATIONAL MODEL OF A NEURON
A simple mathematical model of the biological neuron known as the Threshold Logic Unit (TLU)
was proposed by McCulloch and Pitts [1943]. It consists of a set of incoming connections which
feed the unit with activations coming from other neurons. These inputs are weighted using a set
of weights denoted by fwg. The processing unit then sums all the inputs and applies a nonlinear
threshold function (also known as the activation function) to calculate the output. The resulting
output is then transmitted to other connected neural units. We can denote the operation of a
McCulloch-Pitts neuron as follows:
yDf

n
X
i D1

wi xi C b ;

(3.14)

3.4. LINK WITH BIOLOGICAL VISION

where b is a threshold, wi denote the synapse weight, xi is the input to the neuron, and f ./
represents a nonlinear activation function. For the simplest case, f is a step function which
P
gives 0 (i.e., neuron does not ﬁre) when the input is below 0 (i.e., niD1 wi xi C b less than the
ﬁring threshold) and 1 when it is greater than 0. For other cases, the activation function can be
a sigmoid, tanh, or an ReLU for a smooth thresholding operation (see Chapter 4).
The McCulloch-Pitts neuron is a very simple computational model. However, they have
been shown to approximate complex functions quite well. McCulloch and Pitts showed that a
network comprising of such neurons can perform universal computations. The universal computational ability of neural networks ensures their ability to model a very rich set of continuous
functions using only a ﬁnite number of neurons. This fact is formally known as the “Universal Approximation Theorem” for neural networks. Diﬀerent to McCulloch-Pitts model, state
of the art neuron models also incorporate additional features such as stochastic behaviors and
non-binary input and output.

3.4.3 ARTIFICIAL VS. BIOLOGICAL NEURON
Having outlined the basics of artiﬁcial and biological neuron operation, we can now draw parallels between their functioning and identify the key diﬀerences between the two.
An artiﬁcial neuron (also called a unit or a node) takes several input connections (dendrites
in biological neuron) which are assigned certain weights (analogous to synapses). The unit then
computes the sum of the weighted inputs and applies an activation function (analogous to the cell
body in biological neuron). The result of the unit is then passed on using the output connection
(axon function).
Note that the above-mentioned analogy between a biological and an artiﬁcial neuron is
only valid in the loose sense. In reality, there exists a number of crucial diﬀerences in the functioning of biological neurons. As an example, biological neurons do not sum the weighted inputs, rather dendrites interact in a much complex way to combine the incoming information.
Furthermore, biological neurons communicate asynchronously, diﬀerent from their computational counterparts which operate synchronously. The training mechanisms in both types of
neural networks are also diﬀerent, while the training mechanisms in biological networks are
not precisely known. The topology in biological networks are very complicated compared to the
artiﬁcial networks which currently have either a feed-forward or feed-back architecture.

CHAPTER

Convolutional Neural Network
4.1

INTRODUCTION

We discussed neural networks in Chapter 3. CNNs are one of the most popular categories of
neural networks, especially for high-dimensional data (e.g., images and videos). CNNs operate
in a way that is very similar to standard neural networks. A key diﬀerence, however, is that each
unit in a CNN layer is a two- (or high-) dimensional ﬁlter which is convolved with the input
of that layer. This is essential for cases where we want to learn patterns from high-dimensional
input media, e.g., images or videos. CNN ﬁlters incorporate spatial context by having a similar
(but smaller) spatial shape as the input media, and use parameter sharing to signiﬁcantly reduce
the number of learn-able variables. We will describe these concepts in detail in Chapters 4, 5,
and 6. However, we ﬁnd it important to ﬁrst give a brief historical background of CNNs.
An earliest form of CNN was the Neocognitron model proposed by Kunihiko Fukushima
[Fukushima and Miyake, 1982]. It consisted of multiple layers which automatically learned a
hierarchy of feature abstractions for pattern recognition. The Neocognitron was motivated by the
seminal work by Hubel and Wiesel [1959] on the primary visual cortex which demonstrated that
the neurons in the brain are organized in the form of layers. These layers learn to recognize visual
patterns by ﬁrst extracting local features and subsequently combining them to obtain higherlevel representations. The network training was performed using a reinforcement learning rule.
A major improvement over the Neocognitron was the LeNet model proposed by LeCun et al.
[1989], where the model parameters were learned using error back-propagation. This CNN
model was successfully applied to recognize handwritten digits.
CNNs are a useful class of models for both supervised and unsupervised learning
paradigms. The supervised learning mechanism is the one where the input to the system and
the desired outputs (true labels) are known and the model learns a mapping between the two. In
the unsupervised learning mechanism, the true labels for a given set of inputs are not known
and the model aims to estimate the underlying distribution of the inputs data samples. An example of a supervised learning task (image classiﬁcation) is shown in Fig. 4.1. The CNN learns
to map a given image to its corresponding category by detecting a number of abstract feature
representations, ranging from simple to more complex ones. These discriminative features are
then used within the network to predict the correct category of an input image. The neural network classiﬁer is identical to the MLP we studied in Chapter 3. Recall that we reviewed popular
hand-crafted feature representations and machine learning classiﬁers in Chapter 2. The function
of a CNN is similar to this pipeline, with the key diﬀerence being the automatic learning of a

4. CONVOLUTIONAL NEURAL NETWORK

hierarchy of useful feature representations and its integration of the classiﬁcation and feature
extraction stages in a single pipeline which is trainable in an end-to-end manner. This reduces
the need for manual design and expert human intervention.
Input
Image

Initial
Layers

Intermediate
Layers

Final
Layers
Neural
Network
Classifier

Dog

Complexity of Learned Feature Representations
Low

High

Figure 4.1: A CNN learns low-level features in the initial layers, followed by more complex
intermediate and high-level feature representations which are used for a classiﬁcation task. The
feature visualizations are adopted from Zeiler and Fergus [2014].

4.2

NETWORK LAYERS

A CNN is composed of several basic building blocks, called the CNN layers. In this section,
we study these building blocks and their function in the CNN architecture. Note that some of
these layers implement basic functionalities such as normalization, pooling, convolution, and
fully connected layers. These basic layers are covered ﬁrst in this section to develop a basic understanding of the CNN layers. Along with such basic but fundamental building blocks, we
also introduce several more complex layers later in this section which are composed of multiple
building blocks (e.g., Spatial Transformer Layer and VLAD pooling layer).

4.2.1 PRE-PROCESSING
Before passing input data to the networks, the data needs to be pre-processed. The general preprocessing steps that are used consist of the following.
• Mean-subtraction: The input patches (belonging to both train and test sets) are zerocentered by subtracting the mean computed on the entire training set. Given N training
images, each denoted by x 2 Rhwc , we can denote the mean-subtraction step as follows:
x0 D x

O
x;

where xO D

N
1 X
xi :
N

(4.1)

i D1

• Normalization: The input data (belonging to both train and test sets) is divided with the
standard deviation of each input dimension (pixels in the case of an image) calculated on

4.2. NETWORK LAYERS

the training set to normalize the standard deviation to a unit value. It can be represented
as follows:
x00 D q PN

i D1 .xi

N 1

:
O 2
x/

(4.2)

• PCA Whitening: The aim of PCA whitening is to reduce the correlations between different data dimensions by independently normalizing them. This approach starts with the
zero-centered data and calculates the covariance matrix which encodes the correlation between data dimensions. This covariance matrix is then decomposed via the Singular Value
Decomposition (SVD) algorithm and the data is decorrelated by projecting it onto the
eigenvectors found via SVD. Afterward, each dimension is divided by its corresponding
eigenvalue to normalize all the respective dimensions in the data space.
• Local Contrast Normalization: This normalization scheme gets its motivation from neuroscience. As the name depicts, this approach normalizes the local contrast of the feature
maps to obtain more prominent features. It ﬁrst generates a local neighborhood for each
pixel, e.g., for a unit radius eight neighboring pixels are selected. Afterward, the pixel is
zero-centered with the mean calculated using its own and neighboring pixel values. Similarly, the pixel is also normalized with a standard deviation of its own and neighboring
pixel values (only if the standard deviation is greater than one). The resulting pixel value is
used for further computations.
Another similar approach is the local response normalization [Krizhevsky et al., 2012]
which normalizes the contrast of features obtained from adjacent ﬁlters in a convolution
layer.
Note that PCA whitening can amplify the noise in the data and therefore recent
CNN models just use a simple mean-subtraction (and optionally normalization step) for preprocessing. The scaling and shifting achieved through mean-subtraction and normalization is
helpful for gradient-based learning. This is because equivalent updates are made to the network
weights for all input dimensions, which enables a stable learning process. Furthermore, the local
contrast normalization (LCN) and the local response normalization (LRN) are not common in
the recent architectures since other approaches (e.g., batch normalization, which we will describe
in Section 5.2.4) have proven to be more eﬀective.

4.2.2 CONVOLUTIONAL LAYERS
A convolutional layer is the most important component of a CNN. It comprises a set of ﬁlters
(also called convolutional kernels) which are convolved with a given input to generate an output
feature map.

4. CONVOLUTIONAL NEURAL NETWORK

What is a Filter? Each ﬁlter in a convolutional layer is a grid of discrete numbers. As an
example, consider a 2 2 ﬁlter shown in Fig. 4.2. The weights of each ﬁlter (the numbers in
the grid) are learned during the training of CNN. This learning procedure involves a random
initialization of the ﬁlter weights at the start of the training (diﬀerent approaches for weight
initialization will be discussed in Section 5.1). Afterward, given input-output pairs, the ﬁlter
weights are tuned in a number of diﬀerent iterations during the learning procedure. We will
cover network training in more detail in Chapter 5.
2
-1

0
3

Figure 4.2: An example of a 2D image ﬁlter.
What is a Convolution Operation? We mentioned earlier that the convolution layer performs convolution between the ﬁlters and the input to the layer. Let’s consider a 2D convolution
in Fig. 4.3 to develop an insight into the layer’s operation. Given a 2D input feature map and a
convolution ﬁlter of matrix sizes 4 4 and 2 2, respectively, a convolution layer multiplies the
2 2 ﬁlter with a highlighted patch (also 2 2) of the input feature map and sums up all values
to generate one value in the output feature map. Note that the ﬁlter slides along the width and
height of the input feature map and this process continues until the ﬁlter can no longer slide
further.
Remark: In the signal processing literature, there is a distinction between
the terms “convolution” and “cross correlation.” The operation we described
above is the “correlation operation.” During convolution, the only diﬀerence
is that the ﬁlter is ﬂipped along its height and width before multiplication
and sum-pooling (see Fig. 4.4).
In machine learning, both operations are equivalent and there is rarely
a distinction made between the two. Both terms are used interchangeably
and most of the deep learning libraries implement the correlation operation
in the convolution layers. The reason is that the network optimization will
converge on the right ﬁlter weights for either of the two operations. If the
weights of a convolutional network are replaced with the ones learned using
a correlation network, the network performance will remain the same, because only the order of the operation is changed in the two networks, and
their discriminative ability stays the same. In this book, we follow the machine learning convention and do not make a distinction between the two
operations, i.e., a convolution layer performs the correlation operation in our
case.

4.2. NETWORK LAYERS

Correlation

Convolution

Figure 4.3: The operation of a convolution layer is illustrated in the ﬁgure above. (a)–(i) show
the computations performed at each step, as the ﬁlter is slided onto the input feature map to
compute the corresponding value in the output feature map. The 2 2 ﬁlter (shown in green) is
multiplied with the same sized region (shown in orange) within a 4 4 input feature map and
the resulting values are summed up to obtain a corresponding entry (shown in blue) in the output
feature map at each convolution step.

Filter

Image

Filter

Image

Figure 4.4: The distinction between the correlation and convolution operations in the signal
processing literature. In machine learning, this distinction is usually not important and the deep
learning literature normally refers to the layers implementing the correlation operation as the
convolution operation. In this book, we also follow the same naming convention that is adopted
in the machine learning literature.

4. CONVOLUTIONAL NEURAL NETWORK

In the above example, in order to calculate each value of the output feature map, the ﬁlter
takes a step of 1 along the horizontal or vertical position (i.e., along the column or the row of the
input). This step is termed as the stride of the convolution ﬁlter, which can be set to a diﬀerent
(than 1) value if required. For example, the convolution operation with a stride of 2 is shown in
Fig. 4.5. Compared to the stride of 1 in the previous example, the stride of 2 results in a smaller
output feature map. This reduction in dimensions is referred to as the sub-sampling operation.
Such a reduction in dimensions provides moderate invariance to scale and pose of the objects,
which is a useful property in applications such as object recognition. We will discuss other subsampling mechanisms in the section where we discuss the pooling layers (Section 4.2.3).
We saw in Fig. 4.3 that the spatial size of the output feature map is reduced compared
to the input feature map. Precisely, for a ﬁlter with size f f , an input feature map with size

(g)

Figure 4.5: The operation of a convolution layer with a zero padding of 1 and a stride of 2 is
illustrated in the ﬁgure above. (a)–(i) show the computations that are performed at each step, as
the ﬁlter is slided onto the input feature map to compute the corresponding value of the output
feature map. The 2 2 ﬁlter (shown in green) is multiplied with the same sized region (shown in
orange) within a 6 6 input feature map (including zero-padding) and the resulting values are
summed up to obtain a corresponding entry (shown in blue) in the output feature map at each
convolution step.

4.2. NETWORK LAYERS

h w and a stride length s , the output feature dimensions are given by:
0

h D

f Cs
;
s

w D

f Cs
;
s

(4.3)

where, bc denotes the ﬂoor operation. However, in some applications, such as image de-noising,
super-resolution, or segmentation, we want to keep the spatial sizes constant (or even larger) after convolution. This is important because these applications require more dense predictions at
the pixel level. Moreover, it allows us to design deeper networks (i.e., with more weight layers) by
avoiding a quick collapse of the output feature dimensions. This helps in achieving better performances and higher resolution output labelings. This can be achieved by applying zero-padding
around the input feature map. As shown in Fig. 4.5, zero padding the horizontal and vertical
dimensions allows us to increase the output dimensions and therefore gives more ﬂexibility in
the architecture design. The basic idea is to increase the size of the input feature map such that
an output feature map, with desired dimensions, is obtained. If p denotes the increase in the input feature map along each dimension (by padding zeros), we can represent the modiﬁed output
feature map dimensions as follows:

h f CsCp
w f CsCp
h0 D
; w0 D
:
(4.4)
s
s
In Fig. 4.5, p D 2 and therefore the output dimensions have been increased from 6 6 to 3 3.
If the convolutional layers were to not zero-pad the inputs and apply only valid convolutions,
then the spatial size of the output features will be reduced by a small factor after each convolution
layer and the information at the borders will be “washed away” too quickly.
The padding convolutions are usually categorized into three types based on the involvement of zero-padding.
• Valid Convolution is the simplest case where no zero-padding is involved. The ﬁlter always stays within “valid” positions (i.e., no zero-padded values) in the input feature map
and the output size is reduced by f 1 along the height and the width.
• Same Convolution ensures that the output and input feature maps have equal (the “same”)
sizes. To achieve this, inputs jare kzero-padded appropriately. For example, for a stride of 1,
the padding is given by p D f2 . This is why it is also called “half ” convolution.
• Full Convolution applies the maximum possible padding to the input feature maps before
convolution. The maximum possible padding is the one where at least one valid input value
is involved in all convolution cases. Therefore, it is equivalent to padding f 1 zeros for
a ﬁlter size f so that at the extreme corners at least one valid value will be included in the
convolutions.

4. CONVOLUTIONAL NEURAL NETWORK

Receptive Field: You would have noticed above that we used a relatively small sized kernel
with respect to the input. In computer vision, the inputs are of very high dimensions (e.g.,
images and videos) and are required to be eﬃciently processed through large-scale CNN models.
Therefore, instead of deﬁning convolutional ﬁlters that are equal to the spatial size of the inputs,
we deﬁne them to be of a signiﬁcantly smaller size compared to the input images (e.g., in practice
3 3, 5 5, and 7 7 ﬁlters are used to process images with sizes such as 110 110, 224 224,
and even larger). This design provides two key beneﬁts: (a) the number of learn-able parameters
are greatly reduced when smaller sized kernels are used; and (b) small-sized ﬁlters ensure that
distinctive patterns are learned from the local regions corresponding to, e.g., diﬀerent object
parts in an image. The size (height and width) of the ﬁlter which deﬁnes the spatial extent of
a region, which a ﬁlter can modify at each convolution step, is called the “receptive ﬁeld” of
the ﬁlter. Note that the receptive ﬁeld speciﬁcally relates to the spatial dimensions of the input
image/features. When we stack many convolution layers on top of each other, the “eﬀective
receptive ﬁeld” of each layer (relative to the input of the network) becomes a function of the
receptive ﬁelds of all the previous convolution layers. The eﬀective receptive ﬁeld for a stack of
N convolution layers, each with a kernel size of f , is given as:
RFneﬀ D f C n.f

1/;

n 2 Œ1; N :

(4.5)

As an example, if we stack two convolution layers each of kernel size 5 5 and 3 3, respectively,
the receptive ﬁeld of the second layer would be 3 3 but its eﬀective receptive ﬁeld with respect
to the input image would be 7 7. When the stride and ﬁlter sizes of stacked convolution layers
are diﬀerent, the eﬀective receptive ﬁeld of each layer can be represented in a more general form
as follows:
RFneﬀ

RFneﬀ 1

C .fn

n
Y1

si ;

(4.6)

i D1

where fn denotes the ﬁlter size in the nth layer, si represents the stride length for each previous
layer and RFneﬀ 1 represents the eﬀective receptive ﬁeld of the previous layer.
Extending the Receptive Field: In order to enable very deep models with a relatively reduced number of parameters, a successful strategy is to stack many convolution layers with small
receptive ﬁelds (e.g., 3 3 in the VGGnet [Simonyan and Zisserman, 2014b] in Chapter 6).
However, this limits the spatial context of the learned convolutional ﬁlters which only scales
linearly with the number of layers. In applications such as segmentation and labeling, which
require pixel-wise dense predictions, a desirable characteristic is to aggregate broader contextual information using bigger receptive ﬁelds in the convolution layer. Dilated convolution (or
atrous convolutions [Chen et al., 2014]) is an approach which extends the receptive ﬁeld size,
without increasing the number of parameters [Yu and Koltun, 2015]. The central idea is that
a new dilation parameter (d ) is introduced, which decides on the spacing between the ﬁlter

4.2. NETWORK LAYERS

Figure 4.6: Convolution with a dilated ﬁlter where the dilation factor is d D 2.
weights while performing convolution. As shown in Fig. 4.6, a dilation by a factor of d means
that the original ﬁlter is expanded by d 1 spaces between each element and the intermediate
empty locations are ﬁlled in with zeros. As a result, a ﬁlter of size f f is enlarged to a size of:
f C .d 1/.f 1/. The output dimensions corresponding to a convolution operation with a
pre-deﬁned ﬁlter size (f ), zero padding (p ), stride (s ), dilation factor (d ), an input with height
(h), and width (w ) is given as:
h0 D
w0 D

1/.f
s
1/.f
s

1/ C s C 2p
1/ C s C 2p

(4.7)
:

(4.8)

The eﬀective receptive ﬁeld of an nth layer can be expressed as:
RFneﬀ D RFneﬀ 1 C d.f

1/;

s:t:; RF1eﬀ D f:

(4.9)

The eﬀect of the dilation operation can easily be understood by looking at the dilated ﬁlter for
diﬀerent values of parameter d . In Fig. 4.7, a stack of three convolution layers, each with diﬀerent
dilation parameters is shown. In the ﬁrst layer, d D 1 and the dilated convolution is equivalent
to a standard convolution. The receptive ﬁeld size in this case is equal to the ﬁlter size, i.e., 3. In
the second convolution layer where d D 2, the elements of the kernel are spread out such that
there is one space between each element. This inﬂation of the convolution ﬁlter exponentially
increases the receptive ﬁeld size to 7 7. Similarly in the third layer, d D 3 which increases the
receptive ﬁeld size to 13 13 according to the relationship of Eq. (4.9). This eﬀectively allows
us to incorporate a wider context while performing convolutions. Combining multiple levels

4. CONVOLUTIONAL NEURAL NETWORK
First Convolution Layer

d=1

3
Effective Receptive Field = 3×3

Second Convolution Layer

d=2

7
Effective Receptive Field = 7×7

Third Convolution Layer

d=3
13
Effective Receptive Field = 13×13

Figure 4.7: This ﬁgure shows three convolution layers with a ﬁlter of size 3 3. The ﬁrst, second,
and third layers have dilation factors of one, two, and three, respectively (from left to right). The
eﬀective receptive ﬁeld with respect to the input image is shown in orange at each convolution
layer. Note that an eﬀective receptive ﬁeld corresponds to the size of the region in the input
image which aﬀects each output activation in a convolution layer. At the ﬁrst layer, each output
activation is aﬀected by a 3 3 region in the input image because the ﬁlter size is 3 3. In
the subsequent layers, the receptive ﬁeld increases due to the increasing network depth and the
increasing dilation factor, both of which contribute to the incorporation of a wider context in
the output feature responses.
of image context has been shown to improve the performance of classiﬁcation, detection, and
segmentation using deep CNNs (Chapter 7).
Hyper-parameters: The parameters of the convolution layer which need to be set by the
user (based on cross-validation or experience) prior to the ﬁlter learning (such as the stride and
padding) are called hyper-parameters. These hyper-parameters can be interpreted as the design
choices of our network architecture based on a given application.
High Dimensional Cases: The 2D case is the simplest one, where the ﬁlter has only a
single channel (represented as a matrix) which is convolved with the input feature channels to
produce an output response. For higher dimensional cases, e.g., when the input to the CNN
layers are tensors (e.g., 3D volumes in the case of volumetric representations), the ﬁlters are also
3D cubes which are convolved along the height, width, and depth of the input feature maps to
generate a corresponding 3D output feature map. However, all the concepts that we discussed
above for the 2D case still remain applicable to the processing of 3D and higher dimensional
inputs (such as 3D spatio-temporal representation learning). The only diﬀerence is that the
convolution operation is extended to an extra dimension, e.g., for the case of 3D, in addition to

4.2. NETWORK LAYERS

a convolution along height and width in 2D case, convolutions are performed along the depth
as well. Similarly, zero-padding and striding can be performed along the depth for the 3D case.

4.2.3 POOLING LAYERS
A pooling layer operates on blocks of the input feature map and combines the feature activations.
This combination operation is deﬁned by a pooling function such as the average or the max
function. Similar to the convolution layer, we need to specify the size of the pooled region and
the stride. Figure 4.8 shows the max pooling operation, where the maximum activation is chosen
from the selected block of values. This window is slided across the input feature maps with a
step size deﬁned by the stride (1 in the case of Fig. 4.8). If the size of the pooled region is given
by f f , with a stride s , the size of the output feature map is given by:
h0 D

w
f Cs
; w0 D
s

f Cs
:
s

(4.10)

Figure 4.8: The operation of max-pooling layer when the size of the pooling region is 2 2 and
the stride is 1. (a)–(i) shows the computations performed at each step as the pooled region in the
input feature map (shown in orange) is slided at each step to compute the corresponding value
in the output feature map (shown in blue).

4. CONVOLUTIONAL NEURAL NETWORK

The pooling operation eﬀectively down-samples the input feature map. Such a downsampling process is useful for obtaining a compact feature representation which is invariant to
moderate changes in object scale, pose, and translation in an image [Goodfellow et al., 2016].

4.2.4 NONLINEARITY
The weight layers in a CNN (e.g., convolutional and fully connected layers) are often followed
by a nonlinear activation (or a piece-wise linear) function. The activation function takes a realvalued input and squashes it within a small range such as [0; 1] and [ 1; 1]. The application of
a nonlinear function after the weight layers is highly important, since it allows a neural network
to learn nonlinear mappings. In the absence of nonlinearities, a stacked network of weight layers
is equivalent to a linear mapping from the input domain to the output domain.
A nonlinear function can also be understood as a switching or a selection mechanism,
which decides whether a neuron will ﬁre or not given all of its inputs. The activation functions
that are commonly used in deep networks are diﬀerentiable to enable error back propagation
(Chapter 6). Below is a list of the most common activation functions that are used in deep
neural networks.

(a) Sigmoid

(b) Tanh

(d) ReLU

(e) Noisy ReLU

(f ) Leaky ReLU/PReLU

(g) Randomized
Leaky ReLU

(h) Exponential
Linear Unit

Figure 4.9: Some of the common activation functions that are used in deep neural networks.
Sigmoid (Fig. 4.9a): The sigmoid activation function takes in a real number as its input, and
outputs a number in the range of [0,1]. It is deﬁned as:
fsigm .x/ D

1
1Ce

(4.11)

4.2. NETWORK LAYERS

Tanh (Fig. 4.9b): The tanh activation function implements the hyperbolic tangent function to
squash the input values within the range of [ 1; 1]. It is represented as follows:
ftanh .x/ D

ex e
ex C e

x
x

(4.12)

Algebraic Sigmoid Function (Fig. 4.9c): The algebraic sigmoid function also maps the input
within the range [ 1; 1]. It is given by:
fa

sig .x/

x
Dp
:
1 C x2

(4.13)

Rectiﬁer Linear Unit (Fig. 4.9d): The ReLU is a simple activation function which is of a special
practical importance because of its quick computation. A ReLU function maps the input to a 0
if it is negative and keeps its value unchanged if it is positive. This can be represented as follows:
frelu .x/ D max.0; x/:

(4.14)

The ReLU activation is motivated by the processing in the human visual cortex [Hahnloser et al.,
2000]. The popularity and eﬀectiveness of ReLU has lead to a number of its variants, which
we introduce next. These variants address some of the shortcomings of the ReLU activation
function, e.g., leaky ReLU does not completely reduce the negative inputs to zero.
Noisy ReLU (Fig. 4.9e): The noisy version of ReLU adds a sample drawn from a Gaussian
distribution with mean zero and a variance which depends on the input value (.x/) in the
positive input. It can be represented as follows:
fn

rel .x/

D max.0; x C /;

N .0; .x//:

(4.15)

Leaky ReLU (Fig. 4.9f ): The rectiﬁer function (Fig. 4.9d) completely switches oﬀ the output if
the input is negative. A leaky ReLU function does not reduce the output to a zero value, rather
it outputs a down-scaled version of the negative input. This function is represented as:

x
if x > 0
(4.16)
fl rel .x/ D
cx if x 0;
where c is the leak factor which is a constant and typically set to a small value (e.g., 0:01).
Parametric Linear Units (Fig. 4.9f ): The parametric ReLU function behaves in a similar manner
as the leaky ReLU, with the only diﬀerence that the tunable leak parameter is learned during
the network training. It can be expressed as follows:

x
if x > 0
(4.17)
fp rel .x/ D
ax if x 0;

4. CONVOLUTIONAL NEURAL NETWORK

where a is the leak factor which is automatically learned during the training.
Randomized Leaky Rectiﬁer Linear Unit (Fig. 4.9g): The randomized leaky ReLU (RReLU)
randomly selects the leak factor in the leaky ReLU function from a uniform distribution. Therefore,

x
if x > 0
fr rel .x/ D
:
(4.18)
ax if x 0
The factor a is randomly chosen during training and set to a mean value during the test phase
to get the contribution of all samples. Thus,
a U .l; u/
l Cu
aD
2

during training

(4.19)

during testing:

(4.20)

The upper and lower limits of the uniform distribution are usually set to 8 and 3, respectively.
Exponential Linear Units (Fig. 4.9h): The exponential linear units have both positive and negative values and they therefore try to push the mean activations toward zero (similar to the batch
normalization). It helps in speeding up the training process while achieving a better performance.

x
if x > 0
(4.21)
felu .x/ D
a.e x 1/ if x 0:
Here, a is a non-negative hyper parameter which decides on the saturation level of the ELU in
response to negative inputs.

4.2.5 FULLY CONNECTED LAYERS
Fully connected layers correspond essentially to convolution layers with ﬁlters of size 1 1. Each
unit in a fully connected layer is densely connected to all the units of the previous layer. In a typical CNN, full-connected layers are usually placed toward the end of the architecture. However,
some successful architectures are reported in the literature which use this type of layer at an
intermediate location within a CNN (e.g., NiN [Lin et al., 2013] which will be discussed in
Section 6.3). Its operation can be represented as a simple matrix multiplication followed by
adding a vector of bias terms and applying an element-wise nonlinear function f ./:
y D f .WT x C b/;

(4.22)

where x and y are the vector of input and output activations, respectively, W denotes the matrix
containing the weights of the connections between the layer units, and b represents the bias term
vector. Note that a fully connected layer is identical to a weight layer that we studied in the case
of the Multi-layer Perceptron in Section 3.4.2.

4.2. NETWORK LAYERS

4.2.6 TRANSPOSED CONVOLUTION LAYER
The normal convolution layer maps a spatially large-sized input to a relatively smaller sized output. In several cases, e.g., image super-resolution, we want to go from a spatially low-resolution
feature map to a larger output feature with a higher resolution. This requirement is achieved by
a transposed convolution layer, which is also called a “fractionally strided convolution layer” or
the “up-sampling layer” and sometimes (incorrectly) as the deconvolution layer.1
One can interpret the operation of a transposed convolution layer as the equivalent of a
convolution layer, but by passing through it in the opposite direction, as in a backward pass during back propagation. If the forward pass through the convolution layer gives a low-dimensional
convolved output, the backward pass of the convolved output through the convolution layer
should give the original high spatial dimensional input. This backward transformation layer is
called the “transposed convolution layer.” It can easily be understood in terms of standard convolution by revisiting the example of Fig. 4.3. In that example, a 2 2 ﬁlter was applied to a 4 4
input feature map with a stride of one and no zero padding to generate a 3 3 output feature
map. Note that this convolution operation can be represented as a matrix multiplication, which
oﬀers a highly eﬃcient implementation in practice. For this purpose, we can represent the 2 2
kernel as an unrolled Toeplitz matrix as follows:
2

3
k1;1 k1;2 0 0 k2;1 k2;2 0 0 0 0 0 0 0 0 0 0
6 0 k1;1 k1;2 0 0 k2;1 k2;2 0 0 0 0 0 0 0 0 0 7
6
7
6 0 0 k k
0 k2;1 k2;2 0 0 0 0 0 0 0 0 7
1;1 1;2 0
6
7
6 0 0 0 0 k k
0 k2;1 k2;2 0 0 0 0 0 0 7
6
7
1;1 1;2 0
6
7
K D 6 0 0 0 0 0 k1;1 k1;2 0 0 k2;1 k2;2 0 0 0 0 0 7 :
6
7
6 0 0 0 0 0 0 k1;1 k1;2 0 0 k2;1 k2;2 0 0 0 0 7
6
7
6 0 0 0 0 0 0 0 0 k1;1 k1;2 0 0 k2;1 k2;2 0 0 7
6
7
4 0 0 0 0 0 0 0 0 0 k1;1 k1;2 0 0 k2;1 k2;2 0 5
0 0 0 0 0 0 0 0 0 0 k1;1 k1;2 0 0 k2;1 k2;2

Here, ki;j represents the ﬁlter element in the i th row and j th column. For an input feature map
X , the convolution operation can be expressed in terms of the matrix multiplication between K
and the vectorized form of input, i.e., x D vec.X/:
y D K x;

(4.23)

where y is the corresponding vectorized output. In the transposed convolution, we will input a
3 3 feature map to generate an output feature map of 4 4 as follows:
y D K T x:

(4.24)

Note that the x and y in the above two equations have diﬀerent dimensions.
1 Note

that deconvolution in signal processing literature refers to undoing the eﬀect of the convolution operation with a ﬁlter
F by applying its inverse ﬁlter F 1 . This deconvolution operation is usually performed in the Fourier domain. This is clearly
diﬀerent from the operation of a convolution transpose layer which does not use an inverse ﬁlter.

4. CONVOLUTIONAL NEURAL NETWORK

Figure 4.10: Convolution transpose operation corresponding to a forward convolution operation
with a unit stride and no zero padding shown in Fig. 4.3. The input in this example has been
zero-padded to obtain a 4 4 output. (a)–(i) show the computations performed at each step as
the ﬁlter is slided onto the input feature map to compute the corresponding value in the output
feature map. The 2 2 ﬁlter (shown in green) is multiplied with the same sized region (shown in
orange) within a 5 5 input feature map (including zero-padding) and the resulting values are
summed up to obtain a corresponding entry (shown in blue) in the output feature map at each
convolution transpose step. Note that the convolution transpose operation does not invert the
convolution operation (the input in Fig. 4.3 and the output here are diﬀerent). However, it can
be used to recover the loss in the spatial dimensions of the feature map (input size in Fig. 4.3
and output size here are the same). Furthermore, note that the ﬁlter values have been reversed
for the convolution transpose operation compared to the ﬁlter used in Fig. 4.3.

The transposed convolution layers eﬀectively up-samples the input feature maps. This can
also be understood as a convolution with an input feature map which has additional rows and
columns consisting of null values inserted around the actual values. The convolution with this upsampled input will then generate the desired result. This process is shown in Figs. 4.10 and 4.11
for the outputs we obtained in the examples of Figs. 4.3 and 4.5. It is important to note that the
ﬁlter entries are reversed for the convolution transpose operation. Furthermore, note that the
output from the convolution transpose is equal to the size of input of the convolution operation.

4.2. NETWORK LAYERS

Figure 4.11: Convolution transpose operation corresponding to a forward convolution operation
with a stride of 2 and unit zero padding shown in Fig. 4.5. The input in this example has been
zero-padded in between the feature map values to obtain a 4 4 output (shown in light blue).
(a)–(i) show the computations performed at each step as the ﬁlter is slided onto the input feature
map to compute the corresponding value in the output feature map. The 2 2 ﬁlter (shown in
green) is multiplied with the same sized region (shown in orange) within a 5 5 input feature
map (including zero-padding) and the resulting values are summed up to obtain a corresponding
entry (shown in blue) in the output feature map at each convolution transpose step. Note that the
convolution transpose operation does not invert the convolution operation (input in Fig. 4.5 and
output here are diﬀerent). However, it can be used to recover the loss in the spatial dimensions
of the feature map (input size in Fig. 4.5 and output size here are the same). Furthermore, note
that the ﬁlter values have been reversed for the convolution transpose operation compared to the
ﬁlter used in Fig. 4.5.
However the individual entries are diﬀerent because the convolution transpose does not invert
the forward convolution. We can calculate the size of the output given a kernel of size f f ,
stride s , padding p :
h0 D s.hO 1/ C f
w 0 D s.wO 1/ C f

2p C .h f C 2p/ mod s;
2p C .w f C 2p/ mod s;

(4.25)
(4.26)

4. CONVOLUTIONAL NEURAL NETWORK

where mod denotes the modulus operation, hO and wO denote the input dimensions without any
zero-padding, and h and w denote the spatial dimensions of the input in the equivalent forward
convolution (as shown in Figs. 4.3 and 4.5).
In the example shown in Fig. 4.10, p D 0, s D 1, f D 2, and hO D wO D 3. Therefore, the
spatial dimensions of the output feature map are h0 D w 0 D 4. In Fig. 4.11, s 1 zero values
are added between each pair of input elements to extend the input to produce a spatially large
output. The parameter values are p D 1, s D 2, f D 2, and hO D wO D 3 and the resulting output
dimensions are again h0 D w 0 D 4.
Finally, it is important to note that, from an implementation point of view, the transposed
convolution operation is much faster when implemented as a matrix multiplication operation
compared to the zero padding of the input feature map at intermediate locations followed by
normal convolutions [Dumoulin and Visin, 2016].
Object ROIs

ROI Pooling Layer (2×2)

Object Proposal Generation
(off-the-shelf detector or a
deep network-based detector)
CNN Feature map

Class Probabilities
for all object
bounding boxes

Convolutional Neural Network
During Training

During Testing

Input Image

Bounding Box Regression
Loss
Bounding Box Classification
Loss

Fully Connected
Layers

2×2 Feature
Maps

Figure 4.12: The function of a ROI pooling layer with in an object detection framework is
illustrated in this ﬁgure. Note that a single feature channel, few RoI proposals (just three), and
a relatively smaller output size (2 2) from the ROI pooling layer has been demonstrated here
for the sake of clarity.

4.2.7 REGION OF INTEREST POOLING
The Region of Interest (RoI) Pooling layer is an important component of convolutional neural networks which is mostly used for object detection [Girshick, 2015] (and slightly modiﬁed
versions for related tasks, e.g., instance segmentation [He et al., 2017]). In the object detection
problem, the goal is to precisely locate each object in an image using a bounding box and tag

4.2. NETWORK LAYERS

it with the relevant object category. These objects can be located at any region in an image and
generally vary greatly in their size, shape, appearance, and texture properties. The usual course
of action with such problems is to ﬁrst generate a large set of candidate object proposals with the
aim to include all possible object bounding boxes that may be discovered in an image. For these
initial proposals, usually oﬀ-the-shelf detectors are used such as the selective search [Uijlings
et al., 2013] or the EdgeBox [Zitnick and Dollár, 2014]. Recent works have also proposed approaches to integrate the proposal generation step into a CNN (e.g., Region Proposal Network
[Ren et al., 2015]). Since the valid detections are very few compared to the generated proposals
(usually < 1%), the resources used to process all the negative detections are wasted.
The ROI pooling layer provides a solution to this problem by shifting the processing speciﬁc to individual bounding-boxes later in the network architecture. An input image is processed
through the deep network and intermediate CNN feature maps (with reduced spatial dimensions compared to the input image) are obtained. The ROI pooling layer takes the input feature
map of the complete image and the coordinates of each ROI as its input. The ROI co-ordinates
can be used to roughly locate the features corresponding to a speciﬁc object. However, the features thus obtained have diﬀerent spatial sizes because each ROI can be of a diﬀerent dimension.
Since CNN layers can only operate on ﬁxed dimensional inputs, a ROI pooling layer converts
these variable sized feature maps (corresponding to diﬀerent object proposals) to a ﬁxed sized
output feature map for each object proposal, e.g., a 5 5 or a 7 7 map. The ﬁxed size output
dimensions is a hyper-parameter which is ﬁxed during the training process. Speciﬁcally, this
same-sized output is achieved by dividing each ROI into a set of cells with equal dimensions.
The number of these cells is the same as the required output dimensions. Afterward, the maximum value in each cell is calculated (max-pooling) and it is assigned to the corresponding output
feature map location.
By using a single set of input feature maps to generate a feature representation for each
region proposal, the ROI pooling layer greatly improves the eﬃciency of a deep network. Thus, a
CNN only needs a single pass to compute features corresponding to all the ROIs. It also makes
it possible to train the network in an end-to-end manner, as a uniﬁed system. Note that the
ROI pooling layer is usually plugged in the latter portion of a deep architecture to save a lot of
computations which can result if the region-based processing is performed in the early layers of
the network (due to the large number of proposals). An example use case of ROI pooling layer
in a CNN is described in Section 7.2.1.

4.2.8 SPATIAL PYRAMID POOLING LAYER
The spatial pyramid pooling (SPP) layer [He et al., 2015b] in CNNs is inspired by the pyramidbased approaches proposed for the Bag of Visual Words (BoW) style feature encoding methods
[Lazebnik et al., 2006]. The main intuition behind the SPP layer is that the interesting discriminative features can appear at a variety of scales in the convolutional feature maps. Therefore, it
is useful to incorporate this information for classiﬁcation purposes.

4. CONVOLUTIONAL NEURAL NETWORK

To eﬃciently encode this information in a single descriptor, the SPP layer divides the
feature maps into three levels of spatial blocks. At the global level, the features corresponding
to all the spatial locations are pooled together to obtain just a single vector (with a dimension
equal to the number of channels of the previous layer, say n). At the middle level, the feature
maps are divided into four (2 2) disjoint spatial blocks of equal dimensions and features within
each block are pooled together to obtain a single feature vector in each of the four blocks. This n
dimensional representation is concatenated for each block, resulting in a 4n dimensional feature
vector for the middle level. Finally, at the local level, the feature maps are divided into 16
blocks (4 4). Features within each spatial block are pooled together to give an n dimensional
feature vector. All the 16 features are concatenated to form a single 16n dimensional feature
representation. Finally, the local, mid-level, and global feature representations are concatenated
together to generate a 16n C 4n C n dimensional feature representation, which is forwarded to
the classiﬁer layers (or fully connected layers) for classiﬁcation (see Fig. 4.13).
Fully Connected Layers (fc6, fc7)
Fixed-length Representation

16×256-d

4×256-d

256-d

Spatial Pyramid Pooling Layer
Feature maps of conv5 wih
a depth of 256 (arbitrary size)
Convolutional Layers
Input Image

Figure 4.13: The Spatial Pyramid Pooling Layer [He et al., 2014] incorporates discriminative
information at three scales which is useful for accurate classiﬁcation. (Figure used with permission.)
Therefore, an SPP layer makes use of the localized pooling and the concatenation operations to generate a high-dimensional feature vector as its output. The combination of information at multiple scales helps in achieving robustness against variations in object pose, scale, and
shape (deformations). Since the SPP layer output is not dependent on the length and width of
the feature maps, it allows the CNN to handle input images of any size. Furthermore, it can
perform a similar operation on individual object regions for the detection task. This saves a rea-

4.2. NETWORK LAYERS

sonable amount of time compared to the case where we input individual object proposals and
obtain a feature representation for each of them.

4.2.9 VECTOR OF LOCALLY AGGREGATED DESCRIPTORS LAYER
As we saw for the case of SPP layer, the Vector of Locally Aggregated Descriptors (VLAD)
layer in CNNs [Arandjelovic et al., 2016] also gets its inspiration from the VLAD pooling
approach used in the BoW style models to aggregate local features [Jégou et al., 2010]. The
main idea behind the VLAD layer can be explained as follows. Given a set of local descriptors fxi 2 RD gN
i D1 , we aim to represent these local features in terms of a set of visual words
fci 2 RD gK
(also
called “key-points” or “cluster centers”). This is achieved by ﬁnding the asi D1
sociation of each local descriptor with all the cluster centers. The (soft) association is measured
as a weighted diﬀerence between each descriptor and all the K cluster centers. This results in a
K D dimensional feature matrix F given by:
F .j; k/ D

N
X

ak .xi /.xi .j /

ck .j //:

(4.27)

i D1

Here, the association term ak measures the connection between the i t h local feature (xi ) and
the k th cluster center (ck ), e.g., it will be 0 if xi is furthest away from ck and 1 if xi is closest to
ck . The association term is deﬁned as follows:
exp.wTk xi C bk /
ak .xi / D P
;
T
r exp.wr xi C br /

(4.28)

where, w; b are the weights and biases of a fully connected layer. From an implementation point
of view, the computation of the association term can be understood as passing the descriptors
through a fully connected CNN layer, followed by a soft-max operation. The parameters w; b ,
and c are learned during the training process. Note that since all the operations are diﬀerentiable, an end-to-end training is feasible. For classiﬁcation purposes, the feature matrix F is
ﬁrst normalized column-wise using the `2 norm of each column, then converted to a vector and
again `2 normalized. Figure 4.14 summarizes the operation of a VLAD layer.

4.2.10 SPATIAL TRANSFORMER LAYER
As you would have noticed for the case of the VLAD layer, the introduced layer does not involve
a single operation. Instead, it involves a set of inter-connected sub-modules, each of which
is implemented as an individual layer. The spatial transformer layer [Jaderberg et al., 2015] is
another such example which comprises three main modules, namely (a) localization network,
(b) grid generator, and (c) sampler. Figure 4.15 illustrates the three individual blocks and their
function.
In a nutshell, the spatial transformer layer learns to focus on the interesting part of its
input. This layer applies geometric transformations on the interesting parts of the input to focus

4. CONVOLUTIONAL NEURAL NETWORK

Input Image

Convolutional Neural Network

Local Features
from Previous Layer
∈ ℝ"#$

VLAD Core (%)

Channel-wise &2
Normalization

+
Fully Connected Layer
(( ∈ ℝ$#),
b ∈ ℝ$#))

Soft-max
Operation

Vectorization

&2 Normalization
Soft Assignment

VLAD Feature Vector (K×D dimensions)

Figure 4.14: The ﬁgure shows the working of VLAD Layer [Arandjelovic et al., 2016]. It takes
multiple local features from the CNN layers and aggregate them to generate a high-dimensional
output feature representation.
attention and perform rectiﬁcation. It can be plugged in after the input layer or any of the earlier
convolutional layers which generate a relatively large-sized (heightwidth) output feature map.
The ﬁrst module, called the localization network, takes the input feature maps (or the
original input image) and predicts the parameters of transformation which needs to be applied.
This net can be implemented as any combination of convolutional and/or fully connected layers.
However, the ﬁnal layer is a regression layer which generates the parameter vector . The dimensions of the output parameters depends on the kind of transformation, e.g., for an aﬃne
transformation, it is has six parameters deﬁned as follows:

1 2 3
D
:
(4.29)
4 5 6
The grid generator generates a grid of coordinates in the input image corresponding to each pixel
from the output image. This mapping is useful for the next step, where a sampling function is
used to transform the input image. The sampler generates the output image using the grid given
by the grid generator. This is achieved by using a sampling kernel which is applied at each pixel

4.3. CNN LOSS FUNCTIONS

of the input image (or at each value in the input feature map). To enable end-to-end training, the
sampling kernel should be diﬀerentiable with respect to the input and the grid co-ordinates (x
and y coordinates). Examples of kernels include the nearest neighbor copying from the source
(input) to the target (output) and the bilinear sampling kernels.
Since all the modules are fully diﬀerentiable, the spatial transformation can be learned
end-to-end using a standard back-propagation algorithm. This provides a big advantage since it
allows the network to automatically shift focus toward the more discriminative portions of the
input image or the feature maps.
Localization Network

Grid Generator

Input Image (or
Feature Maps)

Transformation
Parameters

Sampler
Output Image (or
Feature Maps)

Figure 4.15: The spatial transformer layer with its three modules. The localization network predicts the transformation parameters. The grid generator identiﬁes the points in the input domain
on which a sampling kernel is applied using the sampler.

4.3

CNN LOSS FUNCTIONS

Having studied a wide variety of simple and relatively more complex CNN layers, we discuss
the ﬁnal layer in a CNN which is used only during the training process. This layer uses a “loss
function,” also called the “objective function,” to estimate the quality of predictions made by the
network on the training data, for which the actual labels are known. These loss functions are
optimized during the learning process of a CNN, whose details will be covered in Chapter 5.
A loss function quantiﬁes the diﬀerence between the estimated output of the model (the
prediction) and the correct output (the ground truth).
The type of loss function used in our CNN model depends on our end problem. The
generic set problems for which neural networks are usually used (and the associated loss functions) can be categorized into the following categories.
1. Binary Classiﬁcation (SVM hinge loss, Squared hinge loss).
2. Identity Veriﬁcation (Contrastive loss).
3. Multi-class Classiﬁcation (Softmax loss, Expectation loss).
4. Regression (SSIM, `1 error, Euclidean loss).
Note that the loss functions which are suitable for multi-class classiﬁcation tasks are also
applicable to binary classiﬁcation tasks. However, the case reverse is generally not true unless a

4. CONVOLUTIONAL NEURAL NETWORK

multi-class problem is divided into multiple one-vs.-rest binary classiﬁcation problems where a
separate classiﬁer is trained for each case using a binary classiﬁcation loss. Below, we discuss the
above mentioned loss functions in more detail.

4.3.1 CROSS-ENTROPY LOSS
The cross-entropy loss (also termed “log loss” and “soft-max loss”) is deﬁned as follows:
L.p; y/ D

yn log.pn /;

n 2 Œ1; N ;

(4.30)

where y denotes the desired output and p is the probability for each output category. There is
a total of N neurons in the output layer, therefore p; y 2 RN . The probability of each class can
pO n /
be calculated using a soft-max function: pn D Pexp.
, where pOn is the unnormalized output
Ok /
k exp.p
score from the previous layer in the network. Due to the form of the normalizing function in
the loss, this loss is also called the soft-max loss.
It is interesting to note that optimizing the network parameters using the cross-entropy
loss is equivalent to minimizing the KL-divergence between the predicted output (generated
distribution, p) and the desired output (true distribution y). The KL-divergence between p and
y can be expressed as the diﬀerence between cross-entropy (denoted by L./) and the entropy
(denoted by H./) as follows:
KL.p k y/ D L.p; y/

(4.31)

H.p/:

Since entropy is just a constant value, minimizing cross entropy is equivalent to minimizing the
KL-divergence between the two distributions.

4.3.2 SVM HINGE LOSS
The SVM hinge loss is motivated by the error function which is normally used during the training of SVM classiﬁer. Hinge loss maximizes the margin between the true and the negative class
samples. This loss is deﬁned as follows:
L.p; y/ D

max.0; m

.2yn

(4.32)

1/pn /;

where “m” is the margin which is usually set equal to a constant value of 1 and p; y denote the
predicted and desired outputs, respectively. An alternative formulation of the hinge loss is the
Crammer and Singer’s [Crammer and Singer, 2001] loss function given below:
L.p; y/ D max.0; m C max pi
i ¤c

pc /;

c D argmax yj ;
j

(4.33)

4.3. CNN LOSS FUNCTIONS

where, pc denotes the prediction at the correct class index c . Another similar formulation of the
hinge loss was proposed by Weston and Watkins [Weston et al., 1999]:
X
L.p; y/ D
max.0; m C pi pc /;
c D argmax yj :
(4.34)
j

i ¤c

4.3.3 SQUARED HINGE LOSS
The squared hinge loss function has been shown to perform slightly better than the vanilla hinge
loss function in some applications [Tang, 2013]. This loss function simply includes the square
of the max function in Eqs. (4.32)–(4.34). The squared hinge loss is more sensitive to margin
violations compared to the vanilla hinge loss.
4.3.4 EUCLIDEAN LOSS
The Euclidean loss (also termed the “quadratic loss,” “mean square error” or “`2 error”) is deﬁned
in terms of the squared error between the predictions (p 2 RN ) and the ground truth labels
(y 2 RN ):
L.p; y/ D

1 X
.pn
2N n

y n /2 ;

n 2 Œ1; N :

(4.35)

4.3.5 THE `1 ERROR
The `1 loss can be used for regression problems and has been shown to outperform the Euclidean
loss in some cases [Zhao et al., 2015]. It is deﬁned as follows:
L.p; y/ D

1 X
jpn
N n

yn j;

(4.36)

n 2 Œ1; N :

4.3.6 CONTRASTIVE LOSS
The contrastive loss is used to map similar inputs to nearby points in the feature/output space
and to map the dissimilar inputs to distant points. This loss function works on the pairs of
either similar or dissimilar inputs (e.g., in siamese networks [Chopra et al., 2005]). It can be
represented as follows:
L.p; y/ D

1 X 2
yd C .1
2N n

y/ max.0; m

d /2 ;

n 2 Œ1; N ;

(4.37)

where m is the margin and y 2 Œ0; 1 shows whether the input pairs are dissimilar or similar
respectively. Here, “d ” can be any valid distance measure such as the Euclidean distance:
d Dk fa

fb k2 ;

(4.38)

4. CONVOLUTIONAL NEURAL NETWORK

where fa and fb are the learned representations of the two inputs in the feature space and k k2
represents the `2 (or Euclidean) norm.
There are other variants of veriﬁcation losses which extend to triplets (e.g., the triplet loss
is used for triplet networks [Schroﬀ et al., 2015] as opposed to a Siamese network).

4.3.7 EXPECTATION LOSS
The expectation loss is deﬁned as follows:
L.p; y/ D

X
n

jyn

exp.pn /
P
j;
k exp.pk /

n 2 Œ1; N :

(4.39)

It minimizes the expected misclassiﬁcation probability, which is why it is called the expectation loss. Note that the cross entropy loss also uses the soft-max function which is similar to
the expectation loss. However, it directly maximizes the probability of fully correct predictions
[Janocha and Czarnecki, 2017].
The expectation loss provides more robustness against outliers because the objective maximizes the expectation of true predictions. However, this loss function is seldom used in deep
neural networks because it is not a convex or concave function with respect to the weights of
the preceding layer. This leads to optimization issues (such as instability and slow convergence)
during the learning process.

4.3.8 STRUCTURAL SIMILARITY MEASURE
For image processing problems, perceptually grounded loss functions have been used in combination with CNNs [Zhao et al., 2015]. One example of such losses is the Structural Similarity
(SSIM) measure. It is deﬁned as follows:
L.p; y/ D 1

SSIM.n/;

(4.40)

where n is the center pixel of the image, p; y denote the predicted output and the desired output,
respectively, and the structural similarity for that pixel is given by:
SSIM.n/ D

2pn yn C C1
2pn yn C C2

:
p2n C y2n C C1 p2n C y2n C C2

(4.41)

Here, the mean and standard deviation and covariance are represented by ; and pn yn , respectively. At each pixel location n, the mean and standard deviation are calculated using a Gaussian
ﬁlter centered at the pixel with a standard deviation G . C1 ; C2 denote the image dependent
constants which provide stabilization against small denominators. Note that the calculation of
SSIM at one pixel requires the neighboring pixel values in the support of the Gaussian ﬁlter.
Also note that we do not calculate the SSIM measure at each pixel because it cannot be calculated straightforwardly for the pixels that are close to image boundaries.
In the following chapter, we will introduce diﬀerent techniques for weight initialization
and gradient-based parameter learning algorithms for deep networks.

CHAPTER

CNN Learning
In Chapter 4, we discussed diﬀerent architecture blocks of the CNN and their operational details. Most of these CNN layers involve parameters which are required to be tuned appropriately
for a given computer vision task (e.g., image classiﬁcation and object detection). In this chapter,
we will discuss various mechanisms and techniques that are used to set the weights in deep neural
networks. We will ﬁrst cover concepts such as weight initialization and network regularization
in Sections 5.1 and 5.2, respectively, which helps in a successful optimization of the CNNs. Afterward, we will introduce gradient-based parameter learning for CNNs in Section 5.3, which
is quite similar to the MLP parameter learning process discussed in Chapter 3. The details of
neural network optimization algorithms (also called “solvers”) will come in Section 5.4. Finally
in Section 5.5, we will explain various types of approaches which are used for the calculation of
the gradient during the error back-propagation process.

5.1

WEIGHT INITIALIZATION

A correct weight initialization is the key to stably train very deep networks. An ill-suited initialization can lead to the vanishing or exploding gradient problem during error back-propagation.
In this section, we introduce several approaches to perform weight initialization and provide
comparisons between them to illustrate their beneﬁts and problems. Note that the discussion
below pertains to the initialization of neuron weights within a network and the biases are usually
set to zero at the start of the network training. If all the weights are also set to zero at the start
of training, the weight updates will be identical (due to symmetric outputs) and the network
will not learn anything useful. To break this symmetry between neural units, the weights are
initialized randomly at the start of the training. In the following, we describe several popular
approaches to network initialization.

5.1.1 GAUSSIAN RANDOM INITIALIZATION
A common approach to weight initialization in CNNs is the Gaussian random initialization
technique. This approach initializes the convolutional and the fully connected layers using random matrices whose elements are sampled from a Gaussian distribution with zero mean and a
small standard deviation (e.g., 0.1 and 0.01).

5. CNN LEARNING

5.1.2 UNIFORM RANDOM INITIALIZATION
The uniform random initialization approach initializes the convolutional and the fully connected
layers using random matrices whose elements are sampled from a uniform distribution (instead
of a normal distribution as in the earlier case) with a zero mean and a small standard deviation
(e.g., 0.1 and 0.01). The uniform and normal random initializations generally perform identically. However, the training of very deep networks becomes a problem with a random initialization of weights from a uniform or normal distribution [Simonyan and Zisserman, 2014b]. The
reason is that the forward and backward propagated activations can either diminish or explode
when the network is very deep (see Section 3.2.2).
5.1.3 ORTHOGONAL RANDOM INITIALIZATION
Orthogonal random initialization has also been shown to perform well in deep neural networks
[Saxe et al., 2013]. Note that the Gaussian random initialization is only approximately orthogonal. For the orthogonal random initialization, a random weight matrix is decomposed by applying e.g., an SVD. The orthogonal matrix (U) is then used for the weight initialization of the
CNN layers.
5.1.4 UNSUPERVISED PRE-TRAINING
One approach to avoid the gradient diminishing or exploding problem is to use layer-wise pretraining in an unsupervised fashion. However, this type of pre-training has found more success
in the training of deep generative networks, e.g., Deep Belief Networks [Hinton et al., 2006]
and Auto-encoders [Bengio et al., 2007]. The unsupervised pre-training can be followed by
a supervised ﬁne-tuning stage to make use of any available annotations. However, due to the
new hyper-parameters, the considerable amount of eﬀort involved in such an approach and
the availability of better initialization techniques, layer-wise pre-training is seldom used now to
enable the training of CNN-based very deep networks. We describe some of the more successful
approaches to initialize deep CNNs next.
5.1.5 XAVIER INITIALIZATION
A random initialization of a neuron makes the variance of its output directly proportional to
the number of its incoming connections (a neuron’s fan-in measure). To alleviate this problem,
Glorot and Bengio [2010] proposed to randomly initialize the weights with a variance measure
that is dependent on the number of incoming and outgoing connections (nf i n and nf out
respectively) from a neuron,
Var.w/ D

2
i n C nf

;

(5.1)

out

where w are network weights. Note that the fan-out measure is used in the variance above to
balance the back-propagated signal as well. Xavier initialization works quite well in practice and

5.1. WEIGHT INITIALIZATION

leads to better convergence rates. But a number of simplistic assumptions are involved in the
above initialization, among which the most prominent is that a linear relationship between the
input and output of a neuron is assumed. In practice all the neurons contain a nonlinearity term
which makes Xavier initialization statistically less accurate.

5.1.6 RELU AWARE SCALED INITIALIZATION
He et al. [2015a] suggested an improved version of the scaled (or Xavier) initialization noting
that the neurons with a ReLU nonlinearity do not follow the assumptions made for the Xavier
initialization. Precisely, since the ReLU activation reduces nearly half of the inputs to zero,
therefore the variance of the distribution from which the initial weights are randomly sampled
should be
Var.w/ D

2
nf

(5.2)

The ReLU aware scaled initialization works better compared to Xavier initialization for recent
architectures which are based on the ReLU nonlinearity.

5.1.7 LAYER-SEQUENTIAL UNIT VARIANCE
The layer-sequential unit variance (LSUV) initialization is a simple extension of the orthonormal
weight initialization in deep network layers [Mishkin and Matas, 2015]. It combines the beneﬁts
of batch-normalization and the orthonormal weight initialization to achieve an eﬃcient training
for very deep networks. It proceeds in two steps, described below.
• Orthogonal initialization—In the ﬁrst step, all the weight layers (convolutional and fully
connected) are initialized with orthogonal matrices.
• Variance normalization—In the second step, the method starts from the initial toward
the ﬁnal layers in a sequential manner and the variance of each layer output is normalized
to one (unit variance). This is similar to the batch normalization layer, which normalizes
the output activations for each batch to be zero centered with a unit variance. However,
diﬀerent from batch normalization which is applied during the training of the network,
LSUV is applied while initializing the network and therefore saves the overhead of normalization for each batch during the training iterations.

5.1.8 SUPERVISED PRE-TRAINING
In practical scenarios, it is desirable to train very deep networks, but we do not have a large
amount of annotated data available for many problem settings. A very successful practice in
such cases is to ﬁrst train the neural network on a related but diﬀerent problem, where a large
amount of training data is already available. Afterward, the learned model can be “adapted” to
the new task by initializing with weights pre-trained on the larger dataset. This process is called

5. CNN LEARNING

“ﬁne-tuning” and is a simple, yet eﬀective, way to transfer learning from one task to another
(sometimes interchangeably referred to as domain transfer or domain adaptation). As an example, in order to perform scene classiﬁcation on a relatively small dataset, MIT-67, the network
can be initialized with the weights learned for object classiﬁcation on a much larger dataset such
as ImageNet [Khan et al., 2016b].
Transfer Learning is an approach to adapt and apply the knowledge acquired on another related task to the task at hand. Depending on our CNN
architecture, this approach can take two forms.
• Using a Pre-trained Model: If one wants to use an oﬀ-the-shelf CNN
architecture (e.g., AlexNet, GoogleNet, ResNet, DenseNet) for a given
task, an ideal choice is to adopt the available pre-trained models that are
learned on huge datasets such as ImageNet (with 1.2 million images)a
and Places205 (with 2.5 million images).b
The pre-trained model can be tailored for a given task, e.g., by changing the dimensions of output neurons (to cater for a diﬀerent number of classes), modifying the loss function and learning the ﬁnal few
layers from scratch (normally learning the ﬁnal 2-3 layers suﬃces for
most cases). If the dataset available for the end-task is suﬃciently large,
the complete model can also be ﬁne-tuned on the new dataset. For
this purpose, small learning rates are used for the initial pre-trained
CNN layers, so that the learning previously acquired on the large-scale
dataset (e.g., ImageNet) is not completely lost. This is essential, since
it has been shown that the features learned over large-scale datasets
are generic in nature and can be used for new tasks in computer vision
[Azizpour et al., 2016, Sharif Razavian et al., 2014].
• Using a Custom Architecture: If one opts for a customized CNN architecture, transfer learning can still be helpful if the target dataset is constrained in terms of size and diversity. To this end, one can ﬁrst train
the custom architecture on a large scale annotated dataset and then use
the resulting model in the same manner as described in the bullet point
above.
Alongside the simple ﬁne-tuning approach, more involved transfer
learning approaches have also been proposed in the recent literature, e.g., Anderson et al. [2016] learns the way pre-trained model parameters are shifted
on new datasets. The learned transformation is then applied to the network
parameters and the resulting activations beside the pre-trained (non-tunable)
network activations are used in the ﬁnal model.

5.2. REGULARIZATION OF CNN

a Popular

deep learning libraries host a wide variety of pre-trained CNN models, e.g.,
Tensorﬂow (https://github.com/tensorflow/models),
Torch (https://github.com/torch/torch7/wiki/ModelZoo),
Keras (https://github.com/albertomontesg/keras-model-zoo,
Caﬀe (https://github.com/BVLC/caffe/wiki/Model-Zoo),
MatConvNet (http://www.vlfeat.org/matconvnet/pretrained/).
b http://places.csail.mit.edu/downloadCNN.html

5.2

REGULARIZATION OF CNN

Since deep neural networks have a large number of parameters, they tend to over-ﬁt on the
training data during the learning process. By over-ﬁtting, we mean that the model performs
really well on the training data but it fails to generalize well to unseen data. It, therefore, results
in an inferior performance on new data (usually the test set). Regularization approaches aim
to avoid this problem using several intuitive ideas which we discuss below. We can categorize
common regularization approaches into the following classes, based on their central idea:
• approaches which regularize the network using data level techniques
(e.g., data augmentation);
• approaches which introduce stochastic behavior in the neural activations
(e.g., dropout and drop connect);
• approaches which normalize batch statistics in the feature activations
(e.g., batch normalization);
• approaches which use decision level fusion to avoid over-ﬁtting
(e.g., ensemble model averaging);
• approaches which introduce constraints on the network weights
(e.g., `1 norm, `2 norm, max-norm, and elastic net constraints); and
• approaches which use guidance from a validation set to halt the learning process
(e.g., early stopping).
Next, we discuss the above-mentioned approaches in detail.

5.2.1 DATA AUGMENTATION
Data augmentation is the easiest, and often a very eﬀective way of enhancing the generalization
power of CNN models. Especially for cases where the number of training examples is relatively
low, data augmentation can enlarge the dataset (by factors of 16x, 32x, 64x, or even more) to
allow a more robust training of large-scale models.

5. CNN LEARNING

Data augmentation is performed by making several copies from a single image using
straightforward operations such as rotations, cropping, ﬂipping, scaling, translations, and shearing (see Fig. 5.1). These operations can be performed separately or combined together to form
copies, which are both ﬂipped and cropped.
Color jittering is another common way of performing data augmentation. A simple form
of this operation is to perform random contrast jittering in an image. One could also ﬁnd the
principal color directions in the R, G, and B channels (using PCA) and then apply a random
oﬀset along these directions to change the color values of the whole image. This eﬀectively
introduces color and illumination invariance in the learned model [Krizhevsky et al., 2012].
Another approach for data augmentation is to utilize synthetic data, alongside the real
data, to improve the generalization ability of the network [Rahmani and Mian, 2016, Rahmani
et al., 2017, Shrivastava et al., 2016]. Since synthetic data is usually available in large quantities
from rendering engines, it eﬀectively extends the training data, which helps avoid over-ﬁtting.
Image Crops

Rotations

Flips

Theater in Taiwan

Figure 5.1: The ﬁgure shows an example of data augmentation using crops (column 1 and 2),
rotations (column 3) and ﬂips (column 4). Since the input image is quite complex (has several
objects), data augmentation allows the network to ﬁgure out some possible variations of the
same image, which still denote the same scene category, i.e., a theater.

5.2. REGULARIZATION OF CNN

5.2.2 DROPOUT
One of the most popular approaches for neural network regularization is the dropout technique
[Srivastava et al., 2014]. During network training, each neuron is activated with a ﬁxed probability (usually 0.5 or set using a validation set). This random sampling of a sub-network within
the full-scale network introduces an ensemble eﬀect during the testing phase, where the full
network is used to perform prediction. Activation dropout works really well for regularization
purposes and gives a signiﬁcant boost in performance on unseen data in the test phase.
Let us consider a CNN that is composed of L weight layers, indexed by l 2 f1 : : : Lg.
Since dropout has predominantly been applied to Fully Connected (FC) layers in the literature,
we consider the simpler case of FC layers here. Given output activations al 1 from the previous
layer, a FC layer performs an aﬃne transformation followed by a element-wise nonlinearity, as
follows:
al D f .W al

C bl /:

(5.3)

Here, al 1 2 Rn and b 2 Rm denote the activations and biases respectively. The input and output
dimensions of the FC layer are denoted by n and m respectively. W 2 Rmn is the weight matrix
and f ./ is the ReLU activation function.
The random dropout layer generates a mask m 2 Bm , where each element mi is independently sampled from a Bernoulli distribution with a probability “p ” of being “on,” i.e., a neuron
ﬁres:
mi Bernoulli.p/;

mi 2 m:

(5.4)

C bl /;

(5.5)

This mask is used to modify the output activations al :
al D m ı f .W al

where, “ı” denotes the Hadamard product. The Hadamard product denotes a simple element
wise matrix multiplication between the mask and the CNN activations.

5.2.3 DROP-CONNECT
Another similar approach to dropout is the drop-connect [Wan et al., 2013], which randomly
deactivates the network weights (or connections between neurons) instead of randomly reducing
the neuron activations to zero.
Similar to dropout, drop-connect performs a masking out operation on the weight matrix
instead of the output activations, therefore:
al D f ..M ı W/ al
Mi;j Bernoulli.p/;

C bl /;
Mi;j 2 M;

where “ı” denotes the Hadamard product as in the case of dropout.

(5.6)
(5.7)

5. CNN LEARNING

5.2.4 BATCH NORMALIZATION
Batch normalization [Ioﬀe and Szegedy, 2015] normalizes the mean and variance of the output
activations from a CNN layer to follow a unit Gaussian distribution. It proves to be very useful
for the eﬃcient training of a deep network because it reduces the “internal covariance shift” of the
layer activations. Internal covariance shift refers to the change in the distribution of activations
of each layer as the parameters are updated during training. If the distribution, which a hidden
layer of a CNN is trying to model, keeps on changing (i.e., the internal covariance shift is
high), the training process will slow down and the network will take a long time to converge
(simply because it is hard to reach a static target than to reach a continuously shifting target).
The normalization of this distribution leads us to a consistent activation distribution during the
training process, which enhances the convergence and avoids network instability issues such as
the vanishing/exploding gradients and activation saturation.
Reﬂecting on what we have already studied in Chapter 4, this normalization step is similar
to the whitening transform (applied as an input pre-processing step) which enforces the inputs
to follow a unit Gaussian distribution with zero mean and unit variance. However, diﬀerent to
the whitening transform, batch normalization is applied to the intermediate CNN activations
and can be integrated in an end-to-end network because of its diﬀerentiable computations.
The batch normalization operation can be implemented as a layer in a CNN. Given a
set of activations fxi W i 2 Œ1; mg (where xi D fxji W j 2 Œ1; ng has n dimensions) from a CNN
layer corresponding to a speciﬁc input batch with m images, we can compute the ﬁrst and second
order statistics (mean and variance respectively) of the batch for each dimension of activations
as follows:
m

1 X i
xj
m
i D1
m
1 X i
.xj
D
m

(5.8)

xj D
x2j

xj /2 ;

(5.9)

i D1

where xj and x2j represent the mean and variance for the j th activation dimension computed
over a batch, respectively. The normalized activation operation is represented as:
xOji

xji xj
Dq
:
x2j C

(5.10)

Just the normalization of the activations is not suﬃcient, because it can alter the activations and
disrupt the useful patterns that are learned by the network. Therefore, the normalized activations
are rescaled and shifted to allow them to learn useful discriminative representations:
yji D

Oji
jx

C ˇj ;

(5.11)

5.2. REGULARIZATION OF CNN

where

and ˇj are the learnable parameters which are tuned during error back-propagation.
Note that batch normalization is usually applied after the CNN weight layers, before
applying the nonlinear activation function. Batch normalization is an important tool that is
used in state of the art CNN architectures (examples in Chapter 6). We brieﬂy summarize the
beneﬁts of using batch normalization below.
j

• In practice, the network training becomes less sensitive to hyper-parameter choices (e.g.,
learning rate) when batch normalization is used [Ioﬀe and Szegedy, 2015].
• It stabilizes the training of very deep networks and provides robustness against bad weight
initializations. It also avoids the vanishing gradient problem and the saturation of activation functions (e.g., tanh and sigmoid).
• Batch normalization greatly improves the network convergence rate. This is very important because very deep network architectures can take several days (even with reasonable
hardware resources) to train on large-scale datasets.
• It integrates the normalization in the network by allowing back-propagation of errors
through the normalization layer, and therefore allows end-to-end training of deep networks.
• It makes the model less dependent on regularization techniques such as dropout. Therefore, recent architectures do not use dropout when batch normalization is extensively used
as a regularization mechanism [He et al., 2016a].

5.2.5 ENSEMBLE MODEL AVERAGING
The ensemble averaging approach is another simple, but eﬀective, technique where a number of
models are learned instead of just a single model. Each model has diﬀerent parameters due to
diﬀerent random initializations, diﬀerent hyper-parameter choices (e.g., architecture, learning
rate) and/or diﬀerent sets of training inputs. The output from these multiple models is then
combined to generate a ﬁnal prediction score. The prediction combination approach can be a
simple output averaging, a majority voting scheme or a weighted combination of all predictions.
The ﬁnal prediction is more accurate and less prone to over-ﬁtting compared to each individual
model in the ensemble. The committee of experts (ensemble) acts as an eﬀective regularization
mechanism which enhances the generalization power of the overall system.
5.2.6 THE `2 REGULARIZATION
The `2 regularization penalizes large values of the parameters w during the network training.
This is achieved by adding a term with `2 norm of the parameter values weighted by a hyperparameter , which decides on the strength of penalization (in practice half of the squared magnitude times is added to the error function to ensure a simpler derivative term). Eﬀectively,

5. CNN LEARNING

this regularization encourages small and spread-out weight distributions over large values concentrated over only few neurons. Consider a simple network with only a single hidden layer with
parameters w and output pn ; n 2 Œ1; N when the output layer has N neurons. If the desired
output is denoted by yn , we can use an euclidean objective function with `2 regularization to
update the parameters as follows:
w D argmin
w

M X
N
X

.pn

mD1 nD1

y n /2 C k w k 2 ;

(5.12)

where M denote the number of training examples. Note that, as we will discuss later, `2 regularization performs the same operation as the weight decay technique. This approach is called
“weight decay” because applying `2 regularization means that the weights are updated linearly
(since the derivative of the regularizer term is w for each neuron).

5.2.7 THE `1 REGULARIZATION
The `1 regularization technique is very similar to the `2 regularization, with the only diﬀerence
being that the regularizer term uses the `1 norm of weights instead of an `2 norm. A hyperparameter is used to deﬁne the strength of regularization. For a single layered network with
parameters w, we can denote the parameter optimization process using `1 norm as follows:
w D argmin
w

M X
N
X
mD1 nD1

.pn

y n /2 C k w k 1 ;

(5.13)

where N and M denote the number of output neurons and the number of training examples
respectively. This eﬀectively leads to sparse weight vectors for each neuron with most of the
incoming connections having very small weights.

5.2.8 ELASTIC NET REGULARIZATION
Elastic net regularization linearly combines both `1 and `2 regularization techniques by adding
a term 1 jwj C 2 w 2 for each weight value. This results in sparse weights and often performs
better than the individual `1 and `2 regularizations, each of which is a special case of elastic net
regularization. For a single-layered network with parameters w, we can denote the parameter
optimization process as follows:
w D argmin
w

M X
N
X
mD1 nD1

.pn

yn /2 C 1 k w k1 C2 k w k2 ;

(5.14)

where N and M denote the number of output neurons and the number of training examples,
respectively.

5.2. REGULARIZATION OF CNN

5.2.9 MAX-NORM CONSTRAINTS
The max-norm constraint is a form of regularization which puts an upper bound on the norm
of the incoming weights of each neuron in a neural network layer. As a result, the weight vector
w must follow the constraint k w k2 < h, where h is a hyper-parameter, whose value is usually
set based on the performance of the network on a validation set. The beneﬁt of using such a
regularization is that the network parameters are guaranteed to remain in a reasonable numerical
range even when high values of learning rates are used during network training. In practice, this
leads to a better stability and performance [Srivastava et al., 2014].
5.2.10 EARLY STOPPING
The over-ﬁtting problem occurs when a model performs very well on the training set but behaves
poorly on unseen data. Early stopping is applied to avoid overﬁtting in the iterative gradientbased algorithms. This is achieved by evaluating the performance on a held-out validation set at
diﬀerent iterations during the training process. The training algorithm can continue to improve
on the training set until the performance on the validation set also improves. Once there is a
drop in the generalization ability of the learned model, the learning process can be stopped or
slowed down (Fig. 5.2).

Validation Error

Error

Training Error

Stop
at ‘t’

t
Training Iterations

Figure 5.2: An illustration of the early stopping approach during network training.
Having discussed the concepts which enable successful training of deep neural networks
(e.g., correct weight initialization and regularization techniques), we dive into the details of the
network learning process. Gradient-based algorithms are the most important tool to optimally
train such networks on large-scale datasets. In the following, we discuss diﬀerent variants of
optimizers for CNNs.

5. CNN LEARNING

5.3

GRADIENT-BASED CNN LEARNING

The CNN learning process tunes the parameters of the network such that the input space is
correctly mapped to the output space. As discussed before, at each training step, the current
estimate of the output variables is matched with the desired output (often termed the “groundtruth” or the “label space”). This matching function serves as an objective function during the
CNN training and it is usually called the loss function or the error function. In other words,
we can say that the CNN training process involves the optimization of its parameters such that
the loss function is minimized. The CNN parameters are the free/tunable weights in each of
its layers (e.g., ﬁlter weights and biases of the convolution layers) (Chapter 4).
An intuitive, but simple, way to approach this optimization problem is by repeatedly updating the parameters such that the loss function progressively reduces to a minimum value.
It is important to note here that the optimization of nonlinear models (such as CNNs) is a
hard task, exacerbated by the fact that these models are mostly composed of a large number of
tunable parameters. Therefore, instead of solving for a globally optimal solution, we iteratively
search for the locally optimal solution at each step. Here, the gradient-based methods come as
a natural choice, since we need to update the parameters in the direction of the steepest descent. The amount of parameter update, or the size of the update step is called the “learning
rate.” Each iteration which updates the parameters using the complete training set is called a
“training epoch.” We can write each training iteration at time t using the following parameter
update equation:
s:t:;

t D t 1 ıt
ı t D r F . t /;

(5.15)
(5.16)

where F ./ denotes the function represented by the neural network with parameters , r represents the gradient, and denotes the learning rate.

5.3.1 BATCH GRADIENT DESCENT
As we discussed in the previous section, gradient descent algorithms work by computing the
gradient of the objective function with respect to the neural network parameters, followed by
a parameter update in the direction of the steepest descent. The basic version of the gradient
descent, termed “batch gradient descent,” computes this gradient on the entire training set. It
is guaranteed to converge to the global minimum for the case of convex problems. For nonconvex problems, it can still attain a local minimum. However, the training sets can be very
large in computer vision problems, and therefore learning via the batch gradient descent can
be prohibitively slow because for each parameter update, it needs to compute the gradient on
the complete training set. This leads us to the stochastic gradient descent, which eﬀectively
circumvents this problem.

5.4. NEURAL NETWORK OPTIMIZERS

5.3.2 STOCHASTIC GRADIENT DESCENT
Stochastic Gradient Descent (SGD) performs a parameter update for each set of input and
output that are present in the training set. As a result, it converges much faster compared to
the batch gradient descent. Furthermore, it is able to learn in an “online manner,” where the
parameters can be tuned in the presence of new training examples. The only problem is that its
convergence behavior is usually unstable, especially for relatively larger learning rates and when
the training datasets contain diverse examples. When the learning rate is appropriately set, the
SGD generally achieves a similar convergence behavior, compared to the batch gradient descent,
for both the convex and non-convex problems.
5.3.3 MINI-BATCH GRADIENT DESCENT
Finally, the mini-batch gradient descent method is an improved form of the stochastic gradient
descent approach, which provides a decent trade-oﬀ between convergence eﬃciency and convergence stability by dividing the training set into a number of mini-batches, each consisting of
a relatively small number of training examples. The parameter update is then performed after
computing the gradients on each mini-batch. Note that the training examples are usually randomly shuﬄed to improve homogeneity of the training set. This ensures a better convergence
rate compared to the Batch Gradient Descent and a better stability compared to the Stochastic
Gradient Descent [Ruder, 2016].

5.4

NEURAL NETWORK OPTIMIZERS

After a general overview of the gradient descent algorithms in Section 5.3, we can note that
there are certain caveats which must be avoided during the network learning process. As an example, setting the learning rate can be a tricky endeavor in many practical problems. The training
process is often highly aﬀected by the parameter initialization. Furthermore, the vanishing and
exploding gradients problems can occur especially for the case of deep networks. The training
process is also susceptible to get trapped into a local minima, saddle points or a high error plateau
where the gradient is approximately zero in every direction [Pascanu et al., 2014]. Note that the
saddle points (also called the “minmax points”) are those stationary points on the surface of the
function, where the partial derivative with respect to its dimensions becomes zero (Fig. 5.3). In
the following discussion we outline diﬀerent methods to address these limitations of the gradient descent algorithms. Since our goal is to optimize over high-dimensional parameter spaces,
we will restrict our discussion to the more feasible ﬁrst-order methods and will not deal with
the high-order methods (Newton’s method) which are ill-suited for large datasets.

5.4.1 MOMENTUM
Momentum-based optimization provides an improved version of SGD with better convergence
properties. For example, the SGD can oscillate close to a local minima, resulting in an un-

5. CNN LEARNING

200
150
100
50
0
-50
-100
5

-150
-200
-3

0
-2

-1

-5

Figure 5.3: A saddle point shown as a red dot on a 3D surface. Note that the gradient is eﬀectively
zero, but it corresponds to neither a “minima” nor a “maxima” of the function.
necessarily delayed convergence. The momentum adds the gradient calculated at the previous
time-step (a t 1 ) weighted by a parameter to the weight update equation as follows:
t D t 1 at
a t D r F . t / C a t

(5.17)
(5.18)

where F ./ denotes the function represented by the neural network with parameters , r represents the gradient and denotes the learning rate.
The momentum term has physical meanings. The dimensions whose gradients point in the
same direction are magniﬁed quickly, while those dimensions whose gradients keep on changing
directions are suppressed. Essentially, the convergence speed is increased because unnecessary
oscillations are avoided. This can be understood as adding more momentum to the ball so that it
moves along the direction of the maximum slope. Typically, the momentum is set to 0:9 during
the SGD-based learning.

5.4.2 NESTEROV MOMENTUM
The momentum term introduced in the previous section would carry the ball beyond the minimum point. Ideally, we would like the ball to slow down when the ball reaches the minimum
point and the slope starts ascending. This is achieved by the Nesterov momentum [Nesterov,
1983] which computes the gradient at the next approximate point during the parameter update
process, instead of the current point. This gives the algorithm the ability to “look ahead” at each
iteration and plan a jump such that the learning process avoids uphill steps. The update process

5.4. NEURAL NETWORK OPTIMIZERS

Vanilla SGD

SGD with Momentum

Figure 5.4: Comparison of the convergence behavior of the SGD with (right) and without (left)
momentum.
can be represented as:
t D t 1 at
a t D r F . t

C at

(5.19)
(5.20)

where F ./ denotes the function represented by the neural network with parameters , r represents the gradient operation, denotes the learning rate and is the momentum.
Momentum Term

Momentum Term

“Look ahead” Gradient Term

Momentum Step

-1

Nesterov Step

Gradient Term

Figure 5.5: Comparison of the convergence behavior of the SGD with momentum (left) and the
Nesterov update (right). While momentum update can carry the solver quickly toward a local
optima, it can overshoot and miss the optimum point. A solver with Nesterov update corrects
its update by looking ahead and correcting for the next gradient value.

5.4.3 ADAPTIVE GRADIENT
The momentum in the SGD reﬁnes the update direction along the slope of the error function.
However, all parameters are updated at the same rate. In several cases, it is more useful to update
each parameter diﬀerently, depending on its frequency in the training set or its signiﬁcance for
our end problem.
Adaptive Gradient (AdaGrad) algorithm [Duchi et al., 2011] provides a solution to this
problem by using an adaptive learning rate for each individual parameter i . This is done at each
time step t by dividing the learning rate of each parameter with the accumulation of the square

5. CNN LEARNING

of all the historical gradients for each parameter i . This can be shown as follows:
ti D ti

t
P

ı ti ;

2
ıi

(5.21)

where ı ti is the gradient at time-step t with respect to the parameter i and is a very small
term in the denominator to avoid division by zero. The adaptation of the learning rate for each
parameter removes the need to manually set the value of the learning rate. Typically, is kept
ﬁxed to a single value (e.g., 10 2 or 10 3 ) during the training phase. Note that AdaGrad works
very well for sparse gradients, for which a reliable estimate of the past gradients is obtained by
accumulating all of the previous time steps.

5.4.4 ADAPTIVE DELTA
Although AdaGrad eliminates the need to manually set the value of the learning rate at diﬀerent
epochs, it suﬀers from the vanishing learning rate problem. Speciﬁcally, as the number of iterations grows (t is large), the sum of the squared gradients becomes large, making the eﬀective
learning rate very small. As a result, the parameters do not change in the subsequent training
iterations. Lastly, it also needs an initial learning rate to be set during the training phase.
The Adaptive Delta (AdaDelta) algorithm [Zeiler, 2012] solves both these problems by
accumulating only the last k gradients in the denominator term of Eq. (5.21). Therefore, the
new update step can be represented as follows:
ti D ti

t
P

ı ti :
2
ıi

Dt kC1

(5.22)

This requires the storage of the last k gradients at each iteration. In practice, it is much easier to
work with a running average EŒı 2 t , which can be deﬁned as:
EŒı 2 t D EŒı 2 t

C .1

/ı t 2 :

(5.23)

Here, has a similar function to the momentum parameter. Note that the above function implements an exponentially decaying average of the squared gradients for each parameter. The
new update step is:
ti D ti
D

EŒı 2 t C

ı ti :
p
2
EŒı t C

ı ti ;

(5.24)
(5.25)

Note that we still did not get rid of the initial learning rate . Zeiler [2012] noted that this
can be avoided by making the units of the update step consistent by introducing a Hessian

5.4. NEURAL NETWORK OPTIMIZERS

approximation in the update rule. This boils down to the following:
ti

p
EŒ./2 t 1 C i
ıt :
p
EŒı 2 t C

ti 1

(5.26)

Note that we have considered here the local curvature of the function F to be approximately ﬂat
and replaced EŒ. /2 t (not known) with EŒ. /2 t 1 (known).

5.4.5 RMSPROP
RMSprop [Tieleman and Hinton, 2012] is closely related to the AdaDelta approach, aiming to
resolve the vanishing learning rate problem of AdaGrad. Similar to AdaDelta, it also calculates
the running average as follows:
EŒı 2 t D EŒı 2 t

Here, a typical value of
form:

C .1

/ı t 2 :

(5.27)

is 0.9. The update rule of the tunable parameters takes the following
ti D ti

ı ti :
p
EŒı 2 t C

(5.28)

5.4.6 ADAPTIVE MOMENT ESTIMATION
We stated that the AdaGrad solver suﬀers from the vanishing learning rate problem, but AdaGrad is very useful for cases where gradients are sparse. On the other hand, RMSprop does
not reduce the learning rate to a very small value at higher time steps. However on the negative
side, it does not provide an optimal solution for the case of sparse gradients. The ADAptive
Moment Estimation (ADAM) [Kingma and Ba, 2014] approach estimates a separate learning
rate for each parameter and combines the positives of both AdaGrad and RMSprop. The main
diﬀerence between Adam and its two predecessors (RMSprop and AdaDelta) is that the updates are estimated by using both the ﬁrst moment and the second moment of the gradient (as
in Eqs. (5.26) and (5.28)). Therefore, a running average of gradients (mean) is maintained along
with a running average of the squared gradients (variance) as follows:
EŒı t D
EŒı 2 t D

1 EŒı t 1 C .1
2
2 EŒı t 1 C .1

1 /ı t ;
2 /ı t

;

(5.29)
(5.30)

where 1 and 2 are the parameters for running averages of the mean and the variance respectively. Since the initial moment estimates are set to zero, they can remain very small even after many iterations, especially when 1;2 ¤ 1. To overcome this issue, the initialization bias-

5. CNN LEARNING

MNIST Multilayer Neural Network + Dropout

10-1

Training Cost

AdaGrad
RMSProp
SGDNesterov
AdaDelta
Adam

10-2

100

150

200

Iterations over Entire Dataset

Figure 5.6: Convergence performance on the MNIST dataset using diﬀerent neural network
optimizers [Kingma and Ba, 2014]. (Figure used with permission.)
corrected estimates of EŒı t and EŒı 2 t are obtained as follows:
O t D EŒı t
EŒı
1 . 1 /t
2
O 2 t D EŒı t :
EŒı
1 . 2 /t

(5.31)
(5.32)

Very similar to what we studied in the case of AdaGrad, AdaDelta and RMSprop, the update
rule for Adam is given by:

O t:
ti D ti 1 q
EŒı
(5.33)
2
O
EŒı t C
The authors found 1 D 0:9; 2 D 0:999; D 0:001 to be good default values of the decay ( )
and learning () rates during the training process.
Figure 5.6 [Kingma and Ba, 2014] illustrates the convergence performance of the discussed solvers on the MNIST dataset for handwritten digit classiﬁcation. Note that the SGD

5.5. GRADIENT COMPUTATION IN CNNS

with Nesterov shows a good convergence behavior, however it requires a manual tuning of the
learning rate hyper-parameter. Among the solvers with an adaptive learning rate, Adam performs the best in this example (also beating the manually tuned SGD-Nesterov solver). In
practice, Adam usually scales very well to large-scale problems and exhibits nice convergence
properties. That is why Adam is often a default choice for many computer vision applications
based on deep learning.

5.5

GRADIENT COMPUTATION IN CNNS

We have discussed a number of layers and architectures for CNNs. In Section 3.2.2, we also
described the back-propagation algorithm used to train CNNs. In essence, back-propagation
lies at the heart of CNN training. Error back-propagation can only happen if the CNN layers
implement a diﬀerentiable operation. Therefore, it is interesting to study how the gradient can
be computed for the diﬀerent CNN layers. In this section, we will discuss in detail the diﬀerent
approaches which are used to compute the diﬀerentials of popular CNN layers.
We describe in the following the four diﬀerent approaches which can be used to compute
gradients.

5.5.1 ANALYTICAL DIFFERENTIATION
It involves the manual derivation of the derivatives of a function performed by a CNN layer.
These derivatives are then implemented in a computer program to calculate the gradients. The
gradient formulas are then used by an optimization algorithm (e.g., Stochastic Gradient Descent) to learn the optimal CNN weights.
Example: Assume for a simple function, y D f .x/ D x 2 , we want to calculate the derivative analytically. By applying the diﬀerentiation formula for
polynomial functions, we can ﬁnd the derivative as follows:
dy
D 2x;
dx

(5.34)

which can give us the slope at any point x .
Analytically deriving the derivatives of complex expressions is time-consuming and laborious. Furthermore, it is necessary to model the layer operation as a closed-form mathematical
expression. However, it provides an accurate value for the derivative at each point.

5.5.2 NUMERICAL DIFFERENTIATION
Numerical diﬀerentiation techniques use the values of a function to estimate the numerical value
of the derivative of the function at a speciﬁc point.

5. CNN LEARNING

Example: For a given function f .x/, we can estimate the ﬁrst-order numerical derivative at a point x by using the function values at two nearby points,
i.e., f .x/ and f .x C h/, where h is a small change in x :
f .x C h/
h

f .x/

(5.35)

The above equation estimates the ﬁrst-order derivative as the slope of a line
joining the two points f .x/ and f .x C h/. The above expression is called the
“Newton’s Diﬀerence Formula.”
Numerical diﬀerentiation is useful in cases where we know little about the underline real
function or when the actual function is too complex. Also, in several cases we only have access
to discrete sampled data (e.g., at diﬀerent time instances) and a natural choice is to estimate
the derivatives without necessarily modeling the function and calculating the exact derivatives.
Numerical diﬀerentiation is fairly easy to implement, compared to other approaches. However,
numerical diﬀerentiation provides only an estimate of the derivative and works poorly, particularly for the calculation of higher-order derivatives.

5.5.3 SYMBOLIC DIFFERENTIATION
Symbolic diﬀerentiation uses standard diﬀerential calculus formulas to manipulate mathematical expressions using computer algorithms. Popular softwares which perform symbolic diﬀerentiation include Mathematica, Maple, and Matlab.
Example: Suppose we are given a function f .x/ D exp.sin.x//, we need to
calculate its 10th derivative with respect to x . An analytical solution would
be cumbersome, while a numerical solution will be less accurate. In such
cases, we can eﬀectively use symbolic diﬀerentiation to get a reliable answer.
The following code in Matlab (using the Symbolic Math Toolbox) gives the
desired result.
>> syms x
>> f(x) = exp(sin(x))
>> diff(f,x,10)
256exp(sin(x))cos(x)2 - exp(sin(x))sin(x) 5440exp(sin(x))cos(x)4 + 2352exp(sin(x))cos(x)6 - …
Symbolic diﬀerentiation, in a sense, is similar to analytical diﬀerentiation, but leveraging
the power of computers to perform laborious derivations. This approach reduces the need to
manually derive diﬀerentials and avoids the inaccuracies of numerical methods. However, symbolic diﬀerentiation often leads to complex and long expressions which results in slow software

5.5. GRADIENT COMPUTATION IN CNNS

programs. Also, it does not scale well to higher-order derivatives (similar to numerical diﬀerentiation) due to the high complexity of the required computations. Furthermore, in neural network
optimization, we need to calculate partial derivatives with respect to a large number of inputs to
a layer. In such cases, symbolic diﬀerentiation is ineﬃcient and does not scale well to large-scale
networks.

5.5.4 AUTOMATIC DIFFERENTIATION
Automatic diﬀerentiation is a powerful technique which uses both numerical and symbolic techniques to estimate the diﬀerential calculation in the software domain, i.e., given a coded computer program which implements a function, automatic diﬀerentiation can be used to design
another program which implements the derivative of that function. We illustrate the automatic
diﬀerentiation and its relationship with the numerical and symbolic diﬀerentiation in Fig. 5.7.
Numerical Differentiation
Samples from Function

Analytical/Symbolic Differentiation
Actual Function

Automatic Differentiation
Programmed Function

Figure 5.7: Relationships between diﬀerent diﬀerentiation methods.
Every computer program is implemented using a programming language, which only
supports a set of basic functions (e.g., addition, multiplication, exponentiation, logarithm and
trigonometric functions). Automatic diﬀerentiation uses this modular nature of computer programs to break them into simpler elementary functions. The derivatives of these simple functions
are computed symbolically and the chain rule is then applied repeatedly to compute any order
of derivatives of complex programs.
Automatic diﬀerentiation provides an accurate and eﬃcient solution to the diﬀerentiation
of complex expressions. Precisely, automatic diﬀerentiation gives results which are accurate to
the machine precision. The computational complexity of calculating derivatives of a function
is almost the same as evaluating the original function itself. Unlike symbolic diﬀerentiation,
it neither needs a closed form expression of the function nor does it suﬀer from expression
swell which renders symbolic diﬀerentiation ineﬃcient and diﬃcult to code. Current state of

5. CNN LEARNING

Algorithm 5.1 Forward mode of Automatic Diﬀerentiation
Input : x , C
Output : yn
1: y0
x % initialization
2: for all i 2 Œ1; n do
3:
yi
fie .yPa.fie / / % each function operates on its parents output in the graph
4: end for

Algorithm 5.2 Backward mode of Automatic Diﬀerentiation
Input : x , C
n
Output : dy
dx
1: Perform forward mode propagation
2: for all i 2 Œn 1; 0 do
3:
% chain rule to compute derivatives using child nodes in the graph
P dyn dfje
dyn
4:

dyi

end for

dyn
dx

j 2Ch.fie /

dyj dyi

dyn
dy0

the art CNN libraries such as Theano and Tensorﬂow use automatic diﬀerentiation to compute
derivatives (see Chapter 8).
Automatic diﬀerentiation is very closely related to the back-propagation algorithm we
studied before in Section 3.2.2. It operates in two modes, the forward mode and the backward
mode. Given a complex function, we ﬁrst decompose it into a computational graph consisting
of simple elementary functions which are joined with each other to compute the complex function. In the forward mode, given an input x , the computational graph C with n intermediate
states (corresponding to ff e gn elementary functions) can be evaluated sequentially as shown in
Algorithm 5.1.
After the forward computations shown in Algorithm 5.1, the backward mode starts computing the derivatives toward the end and successively applies the chain rule to calculate the
diﬀerential with respect to each intermediate output variable yi , as shown in Algorithm 5.2.
A basic assumption in the automatic diﬀerentiation approach is that the expression is
diﬀerentiable. If this is not the case, automatic diﬀerentiation will fail. We provide a simple example of forward and backward modes of automatic diﬀerentiation below and refer the reader to
Baydin et al. [2015] for a detailed treatment of this subject in relation to machine/deep learning
techniques.

5.5. GRADIENT COMPUTATION IN CNNS

Example: Consider a slightly more complex function than the previous example for symbolic diﬀerentiation,
y D f .x/ D exp.sin.x/ C sin.x/2 / C sin.exp.x/ C exp.x/2 /:

(5.36)

We can represent its analytically or symbolically calculated diﬀerential as follows:
df
D cos.exp.2x/ C exp.x//.2 exp.2x/ C exp.x//
dx
C exp.sin.x/2 C sin.x//.cos.x/ C 2 cos.x/ sin.x//:

(5.37)

But, if we are interested in calculating its derivative using automatic diﬀerentiation, the ﬁrst step would be to represent the complete function in terms
of basic operations (addition, exp and sin) deﬁned as:
a D sin.x/
d D exp.c/
g DeCf

b D a2
e D exp.x/
h D sin.g/

c DaCb
f D e2
y D d C h:

(5.38)

The ﬂow of computations in terms of these basic operations is illustrated
in the computational graph in Fig. 5.8. Given this computational graph we
can easily calculate the diﬀerential of the output with respect to each of the
variables in the graph as follows:
dy
D1
dd
dy
dy dc
D
db
dc db
dy
D
de

dy
dy
dy dd
dy dh
dy
D1
D
D
dh
dc
dd dc
dg
dh dg
dy
dy dc
dy db
dy
dy dg
D
C
D
da
dc da
db da
df
dg df
dy dg
dy df
dy da
dy de
dy
C
D
C
:
dg de
df de
dx
da dx
de dx

(5.39)

All of the above diﬀerentials can easily be computed because it is simple to
compute the derivative of each basic function, e.g.,
dd
D exp.c/
dc

dh
D cos.g/
dg

dc
dc
D
D1
db
da

df
D 2e: (5.40)
de

Note that we started toward the end of the computational graph, and computed all of the intermediate diﬀerentials moving backward, until we got
the diﬀerential with respect to input. The original diﬀerential expression we
calculated in Eq. (5.37) was quite complex. However, once we decomposed
the original expression in to simpler functions in Eq. (5.38), we note that

5. CNN LEARNING

the complexity of the operations that are required to calculate the derivative (back-ward pass) is almost the same as the calculation of the original
expression (forward pass) according to the computational graph. Automatic
diﬀerentiation uses the forward and backward operation modes to eﬃciently
and precisely calculate the diﬀerentials of complex functions. As we discussed
above, the operations for the calculation of diﬀerentials in this manner have
a close resemblance to the back-propagation algorithm.

Figure 5.8: Computational graph showing the calculation of our desired function.

5.6

UNDERSTANDING CNN THROUGH VISUALIZATION

Convolutional networks are large-scale models with a huge number of parameters that are
learned in a data driven fashion. Plotting an error curve and objective function on the training and validation sets against the training iterations is one way to track the overall training
progress. However, this approach does not give an insight into the actual parameters and activations of the CNN layers. It is often useful to visualize what CNNs have learned during or after
the completion of the training process. We outline some basic approaches to visualize CNN
features and activations below. These approaches can be categorized into three types depending on the network signal that is used to obtain the visualization, i.e., weights, activations, and
gradients. We summarize some of these three types of visualization methods below.

5.6.1 VISUALIZING LEARNED WEIGHTS
One of the simplest approaches to visualize what a CNN has learned is to look at the convolution ﬁlters. For example, 9 9 and 5 5 sized convolution kernels that are learned on a
labeled shadow dataset are illustrated in Fig. 5.9. These ﬁlters correspond to the ﬁrst and second
convolutional layers in a LeNet style CNN (see Chapter 6 for details on CNN architectures).

5.6. UNDERSTANDING CNN THROUGH VISUALIZATION

Figure 5.9: Examples of 9 9 (left) and 5 5 (right) sized convolution kernels learned for the
shadow detection task. The ﬁlters illustrate the type of patterns that a particular CNN layer is
looking for in the input data. (Figure adapted from Khan et al. [2014].)

5.6.2 VISUALIZING ACTIVATIONS
The feature activations from the intermediate CNN layers can also provide useful clues about
the quality of learned representations.
A trivial way to visualize learned features is to plot the output feature activations corresponding to an input image. As an example, we show the output convolutional activations
(or features) corresponding to sample digits 2, 5, and 0 belonging to the MNIST dataset in
Fig. 5.10. Speciﬁcally, these features are the output of the ﬁrst convolution layer in the LeNet
architecture (see Chapter 6 for details on this architecture type).

Figure 5.10: Intermediate feature representations from a CNN corresponding to example input
images of handwritten digits from the MNIST dataset. Such a visualization of output activations
can provide an insight about the patterns in input images that are extracted as useful features for
classiﬁcation.
Another approach is to obtain the feature representation from the penultimate layer of a
trained CNN and visualize all the training images in a dataset as a 2D plot (e.g., using tSNE low
dimensional embedding). The tSNE embedding approximately preserves the original distances
in the high dimensional space between features. An example of such a visualization is shown in
Fig. 5.11. This visualization can provide a holistic view and suggest the quality of learned feature
representations for diﬀerent classes. For instance, classes that are clustered tightly together in
the feature space will be classiﬁed more accurately compared to the ones that have a widespread

5. CNN LEARNING

Figure 5.11: tSNE visualization of the ﬁnal fully connected layer features corresponding to images from the MIT-67 indoor scene dataset. Each color represents a diﬀerent class in the dataset.
Note that the features belonging to the same class are clustered together. (Figure adapted from
Khan et al. [2017b].)

overlapping with other classes, making it diﬃcult to accurately model the classiﬁcation boundaries. Alongside visualizing 2D embedding of the high-dimensional feature vectors, we can also
visualize the input images associated with each feature vector (see Fig. 5.13). In this manner,
we can observe how visually similar images are clustered together in the tSNE embedding of
high-dimensional features. As an example, cellar images are clustered together on the top left
in the illustration shown in Fig. 5.13.
An attention map can also provide an insight into the regions which were given more
importance while making a classiﬁcation decision. In other words, we can visualize the regions
in an image which contributed most to the correct prediction of a category. One way to achieve
this is to obtain prediction scores for individual overlapping patches with in an image and plot
the ﬁnal prediction score for the correct class. Examples of such a visualization is shown in
Fig. 5.12. We can note that distinctive portions of the scenes played a crucial role in the correct
prediction of the category for each image, e.g., a screen in an auditorium, train in a train station
and bed in a bedroom.
Other interesting approaches have also been used in the literature to visualize the image
parts which contribute most to the correct prediction. Zeiler and Fergus [2014] systematically
occluded a square patch within an input image and plotted a heat map indicating the change in
the prediction probability of the correct class. The resulting heat map indicates which regions in
an input image are most important for a correct output response from the network (see Fig. 5.14a
for examples). Bolei et al. [2015] ﬁrst segments an input image into regions, these segments

5.6. UNDERSTANDING CNN THROUGH VISUALIZATION

Figure 5.12: The contributions of distinctive patches with in an image toward the prediction of
correct scene class are shown in the form of a heat map (“red” color denotes higher contribution).
(Figure adapted from Khan et al. [2016b].)
are then iteratively dropped such that the correct class prediction is least aﬀected. This process
continues until an image with minimal scene details is left. These details are suﬃcient for the
correct classiﬁcation of the input image (see Fig. 5.14b for examples).

5.6.3 VISUALIZATIONS BASED ON GRADIENTS
Balduzzi et al. [2017] presented the idea that visualizing distributions of gradients can provide
useful insight into the convergence behavior of deep neural networks. Their analysis showed that
naively increasing the depth of a neural network results in the gradient shattering problem (i.e.,
the gradients show similar distribution as white noise). They demonstrated that the gradient
distribution resembled brown noise when batch normalization and skip connections were used
in a very deep network (see Fig. 5.15).
Back-propagated gradients within a CNN have been used to identify speciﬁc patterns in
the input image that maximally activate a particular neuron in a CNN layer. In other words,
gradients can be adjusted to generate visualizations that illustrate the patterns that a neuron has
learned to look for in the input data. Zeiler and Fergus [2014] pioneered this idea and introduced
a deconvolution-based approach to invert the feature representations to identify associated patterns in the input images. Yosinski et al. [2015] generated synthetic images by ﬁrst selecting
an i th neuron in a CNN layer. Afterward, an input image with random color values is passed

5. CNN LEARNING

Figure 5.13: Visualization of images based on the tSNE embedding of the convolutional features
from a deep network. The images belong to the MIT-67 dataset. Note that the images belonging
to the same class are clustered together, e.g., all cellar images can be found in the top-left corner.
(Figure adapted from Hayat et al. [2016].)

5.6. UNDERSTANDING CNN THROUGH VISUALIZATION

True Label: Pomeranian

True Label: Car Wheel

True Label: Afghan Hound

(a)

(b)

Figure 5.14: Visualization of regions which are important for the correct prediction from a
deep network. (a) The grey regions in input images is sequentially occluded and the output
probability of correct class is plotted as a heat map (blue regions indicate high importance for
correct classiﬁcation). (b) Segmented regions in an image are occluded until the minimal image
details that are required for correct scene class prediction are left. As we expect, a bed is found
to be the most important aspect of a bed-room scene. (Figures adapted from Bolei et al. [2015]
and Zeiler and Fergus [2014], used with permission.)

through the network and the corresponding activation value for i th neuron is calculated. The
gradient of this activation with respect to the input image is calculated via error back propagation. This gradient basically denotes how the pixels can be changed such that the neuron gets
maximally activated. The input gets iteratively modiﬁed using this information so that we obtain
an image which results in a high activation for neuron i . Normally it is useful to impose a “style
constraint” while modifying the input image which acts as a regularizer and enforces the input
image to remain similar to the training data. Some example images thus obtained are shown in
Fig. 5.16.
Mahendran and Vedaldi [2015] propose an approach to recover input images that correspond to identical high-dimensional feature representations in the second to last layer of CNN.
Given an image x, a learned CNN F and a high dimensional representation f, the following loss
is minimized through gradient descent optimization to ﬁnd images in the input domain that

5. CNN LEARNING

Gradients

2.0
1.5
1.0
0.5
0.0

Covariance Matrices

-0.5
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -2.0
Input
0

100

150

200

250

0.20
0.15
0.10
0.05
0.00
-0.05
-0.10
-0.15
-0.20
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -2.0
Input
0

100

150

200

250

4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -2.0
Input
0

100

150

200

250

(a) 1-Layer
Feedforward

(b) 24-Layer
Feedforward

100

150

200

250

Noise

0.4

0.2

3
2

0.0

-0.2

-0.4

-1

-0.6

-2

-0.8
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -2.0

-3
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -2.0

100

150 200

250

100

150

200

250

(d) Brown Noise

100

150 200

250

(d) White Noise

Figure 5.15: The top row shows the gradient distributions for a range of uniformly sampled
inputs. The bottom row shows the covariance matrices. The gradient for the case of a 24 layered
network resemble white noise, while the gradients of high-performing ResNet resemble brown
noise. Note that the ResNet used skip connections and batch normalization which results in
better converegence even with much deeper network architectures. The ResNet architecture will
be discussed in detail in Chapter 6. (Figure from Balduzzi et al. [2017], used with permission.)

Rocking Chair

Teddy Bear

Windsor Tie

Pitcher

Figure 5.16: The ﬁgure illustrates synthetic input images that maximally activate the output
neurons corresponding to diﬀerent classes. As evident, the network looks for objects with similar
low-level and high-level cues, e.g., edge information and holistic shape information, respectively,
in order to maximally activate the relevant neuron. (Figure from Yosinski et al. [2015], used with
permission.)

5.6. UNDERSTANDING CNN THROUGH VISUALIZATION

correspond to the same feature representation f:
x

argmi n k F .x/
x

f k2 CR.x/:

Here, R denotes a regularizer which enforces the generated images to look natural and avoid
noisy and spiky patterns that are not helpful in perceptual visualization. Figure 5.17 illustrates
examples of reconstructed images obtained from the CNN features at diﬀerent layers of the
network. It is interesting to note that the objects vary in position, scale and deformations but the
holistic characteristics remain the same. This denotes that the network has learned meaningful
characteristics of the input image data.

pool5

relu6

relu7

fc8

Figure 5.17: Multiple inverted images obtained from the high dimensional feature representations of a CNN. Note that the features from lower layers retain local information better than
the higher ones. (Figure from Mahendran and Vedaldi [2015], used with permission.)
Up to this point, we have completed the discussion about the CNN architecture, its training process and the visualization approaches to understand the working of CNNs. Next, we will
describe a number of successful CNN examples from the literature which will help us develop
an insight into the state-of-the-art network topologies and their mutual pros and cons.

101

CHAPTER

Examples of CNN
Architectures
We have covered the basic modules in the previous chapters which can be joined together to develop CNN-based deep learning models. Among these modules, we covered convolution, subsampling and several other layers which form large-scale CNN architectures. We noticed that
the loss functions are used during training to measure the diﬀerence between the predicted and
desired outputs from the model. We discussed modules which are used to regularize the networks
and optimize their performance and convergence speeds. We also covered several gradient-based
learning algorithms for successful CNN training, along with diﬀerent tricks to achieve stable
training of CNNs, such as weight initialization strategies. In this chapter, we introduce several
successful CNN designs which are constructed using the basic building blocks that we studied
in the previous chapters. Among these, we present both early architectures (which have been
traditionally popular in computer vision and are rather easier to understand) and the most recent CNN models (which are relatively complex and built on top of the conventional designs).
We note that there is a natural order in these architectures according to which their designs have
evolved through the recent years. Therefore, we elaborate on each of these designs, while emphasizing their connections and the trend in which the designs have progressed. In the following,
we begin with a simple CNN architecture, known as the LeNet.

6.1

LENET

The LeNet [LeCun et al., 1998] architecture is one of the earliest and basic forms of CNNs
which was applied to handwritten digit identiﬁcation. A successful variant of this architecture
style is called the LeNet-5 model, because it comprises 5 weight layers in total. Speciﬁcally,
it consists of two convolution layers each followed by a sub-sampling (max-pooling) layer to
extract features. Afterward, a single convolution layer, followed by a set of two fully connected
layers toward the end of the model to act as a classiﬁer on the extracted features. Note that
the activations after the weight layers are also squashed using a tanh nonlinearity. The model
architecture used to train on the MNIST digit dataset is shown in Fig. 6.1.

102

6. EXAMPLES OF CNN ARCHITECTURES

Input Image
32×32

Convolution
Layer
5×5 (6 filters)

Max-pooling
(2×2)

Convolution Max-pooling
Convolution
Layer
(2×2)
Layer
5×5 (16 filters)
4×4 (120 filters)

FC
Layer
(64)

FC
Layer
(10)

Figure 6.1: The LeNet-5 Architecture.

6.2

ALEXNET

AlexNet [Krizhevsky et al., 2012] was the ﬁrst large-scale CNN model which led to the resurgence of deep neural networks in computer vision (other CNN architectures before AlexNet,
e.g., [Cireşan et al., 2011, LeCun et al., 1998] were relatively smaller and were not tested on
large-scale datasets such as the ImageNet dataset). This architecture won the ImageNet LargeScale Visual Recognition Challenge (ILSVRC) in 2012 by a large margin.
The main diﬀerence between the AlexNet architecture and its predecessors is the increased
network depth, which leads to a signiﬁcantly larger number of tunable parameters, and the used
regularization tricks (such as the activation dropout [Srivastava et al., 2014] and data augmentation). It consists of a total of eight parameter layers among which the ﬁve initial layers are
convolutional layers, while the later three layers are fully connected layers. The ﬁnal fully connected layer (i.e., the output layer) classiﬁes an input image into one of the thousand classes of
the ImageNet dataset, and therefore contains 1,000 units. The ﬁlter sizes and the location of
the max-pooling layers are shown in Fig. 6.2. Note that dropout is applied after the ﬁrst two
fully connected layers in the AlexNet architecture, which leads to a reduced over-ﬁtting and a
better generalization to unseen examples. Another distinguishing aspect of AlexNet is the usage
of ReLU nonlinearity after every convolutional and fully connected layer, which substantially
improves the training eﬃciency compared to the traditionally used tanh function.
Although AlexNet is relatively much smaller (in terms of the number of layers) compared
to the recent state-of-the-art CNN architectures, it is interesting to note that Krizhevsky et al.
[2012] had to split the training between two GPUs at the time of its ﬁrst implementation.
This was necessary because a single NVIDIA GTX 580 (with 3 Gb memory) could not hold
the complete network which consists of around 62 million parameters. It took around six days
to train the network on the complete ImageNet dataset. Note that the ImageNet training set
contains 1:2 million images belonging to a thousand of diﬀerent object classes.

6.3. NETWORK IN NETWORK

Input Image
(224×224×3)

5×5 (256)

11×11 (96)
Stride 4

3×3 (384)

103

3×3 (384)

Fully Connected Layers
Convolution Layers
Max-pooling Layers
Dropout Layers
3×3 (256)

4096

1000

Figure 6.2: The AlexNet Architecture.

6.3

NETWORK IN NETWORK

The Network in Network (NiN) [Lin et al., 2013] architecture is a simple and lightweight CNN
model which generally performs really well on small-scale datasets. It introduces two new ideas
in the CNN design. First, it shows that incorporating fully connected layers in-between the
convolutional layers results is helpful in network training. Therefore, the example architecture
consists of three convolutional layers at the ﬁrst, fourth, and and seventh locations (among the
weight layers) with ﬁlters of size 5 5, 5 5, and 3 3, respectively. Each of these convolutional
layers is followed by a pair of fully connected layers (or convolutional layers with 1 1 ﬁlter sizes)
and a max-pooling layer. Second, this architecture utilizes a global average pooling at the end of
the model as a regularizer. This pooling scheme just combines all activations within each feature
map (by averaging) to obtain a single classiﬁcation score which is forwarded to a soft-max loss
layer. Note that a drop-out regularization block after the ﬁrst two max-pooling layers also help
in achieving a lower test error on a given dataset.

Input Image

5×5

Fully Connected Layers

5×5
Convolution Layers

Figure 6.3: The NiN Architecture.

Pooling Layers (max or avg)

3×3
Dropout Layers

104

6. EXAMPLES OF CNN ARCHITECTURES

6.4

VGGNET

The VGGnet architecture [Simonyan and Zisserman, 2014b] is one of the most popular CNN
models since its introduction in 2014, even though it was not the winner of ILSVRC’14. The
reason for its popularity is in its model simplicity and the use of small-sized convolutional kernels
which leads to very deep networks. The authors introduced a set of network conﬁgurations,
among which the conﬁguration D and E (commonly referred as VGGnet-16 and VGGnet-19
in the literature) are the most successful ones.
The VGGnet architecture strictly uses 3 3 convolution kernels with intermediate maxpooling layers for feature extraction and a set of three fully connected layers toward the end for
classiﬁcation. Each convolution layer is followed by a ReLU layer in the VGGnet architecture.
The design choice of using smaller kernels leads to a relatively reduced number of parameters, and
therefore an eﬃcient training and testing. Moreover, by stacking a series of 3 3 sized kernels,
the eﬀective receptive ﬁeld can be increased to larger values (e.g., 5 5 with two layers, 7 7 with
three layers, and so on). Most importantly, with smaller ﬁlters, one can stack more layers resulting
in deeper networks, which leads to a better performance on vision tasks. This essentially conveys
the central idea of this architecture, which supports the usage of deeper networks for an improved
feature learning. Figure 6.4 shows the best performing model VGGnet-16 (conﬁugration D)
which has 138 million parameters. Similar to AlexNet, it also uses activation dropouts in the
ﬁrst two fully connected layers to avoid over-ﬁtting.

Input Image
(224×224×3)

3×3
(64)

3×3
(128)

3×3
(256)

3×3
(512)

Fully Connected Layers
Convolution Layers
Max-pooling Layers
Dropout Layers
3×3
(512)

3×3
(512)

3×3 4096 4096 1000
(512)

Figure 6.4: The VGGnet-16 Architecture (conﬁguration-D).

6.5

GOOGLENET

All the previously discussed, networks consists of a sequential architecture with only a single
path. Along this path, diﬀerent types of layers such as the convolution, pooling, ReLU, dropout,
and fully connected layers are stacked on top of each others to create an architecture of desired

6.5. GOOGLENET

105

depth. The GoogleNet [Szegedy et al., 2015] architecture is the ﬁrst popular model which uses
a more complex architecture with several network branches. This model won the ILSVRC’14
competition with the best top-5 error rate of 6.7% on the classiﬁcation task. Afterward, several improved and extended versions of GoogleNet have also been proposed. However, we will
restrict this discussion to the ILSVRC’14 submission of the GoogleNet.
Convolution
Layer (1×1)

Output from
Previous Layer

Convolution
Layer (3×3)
Convolution
Layer (5×5)

Convolution
Layer (1×1)

Feature
Concatenation
(along depth)

Max-pooling
(3×3)

(a) Basic Inception Module Concept

Output from
Previous Layer

Convolution
Layer (1×1)

Convolution
Layer (3×3)

Convolution
Layer (1×1)

Convolution
Layer (5×5)

Max-pooling
(3×3)

Convolution
Layer (1×1)

Feature
Concatenation
(along depth)

(b) Inception Module with Dimension Reductions

Figure 6.5: The Inception Module.
GoogleNet consists of a total of 22 weight layers. The basic building block of the network
is the “Inception Module,” due to which the architecture is also commonly called the “Inception
Network.” The processing of this module happens in parallel, which is in contrast to the sequential processing of previously discussed architectures. A simple (naive) version of this module
in shown in Fig. 6.5a. The central idea here is to place all the basic processing blocks (which
occur in a regular sequential convolutional network) in parallel and combine their output feature representations. The good thing with this design is that multiple inception modules can be
stacked together to create a giant network and without the need to worry about the design of
each individual layer at diﬀerent stages of the network. However, as you might have noticed,
the problem is that if we concatenate all the individual feature representations from each individual block along the depth dimension, it will result in a very high-dimensional feature output.
To overcome this problem, the full inception module performs dimensionality reduction before
passing the input feature volume (say with dimensions h w d ) through the 3 3 and 5 5
convolution ﬁlters. This dimensionality reduction is performed by using a fully connected layer
which is equivalent to a 1 1 dimensional convolution operation. As an example, if the 1 1
convolution layer has d 0 ﬁlters, such that d 0 < d , the output from this layer will have a smaller
dimension of h w d 0 . We saw a similar fully connected layer before the convolution layer in
the NiN architecture discussed earlier in Section 6.3. In both cases, such a layer leads to a better
performance. You may be wondering why a fully connected layer is useful when used before the
convolutional layers. The answer is that while the convolution ﬁlters operate in the spatial domain (i.e., along the height and width of the input feature channels), a fully connected layer can
combine information from multiple feature channels (i.e., along the depth dimension). Such a

106

6. EXAMPLES OF CNN ARCHITECTURES

ﬂexible combination of information leads to not only reduced feature dimensions, but also to an
enhanced performance of the inception module.
If you closely look at the inception module (Fig. 6.5), you can understand the intuition
behind the combination of a set of diﬀerent operations into a single block. The advantage is
that the features are extracted using a range of ﬁlter sizes (e.g., 1 1; 3 3; 5 5) which corresponds to diﬀerent receptive ﬁelds and an encoding of features at multiple levels from the
input. Similarly, there is a max-pooling layer which down-samples the input to obtain a feature
representation. Since all the convolution layers in the GoogleNet architecture are followed by a
ReLU nonlinearity, this further enhances the capability of the network to model nonlinear relationships. In the end, these diﬀerent complementary features are combined together to obtain
a more useful feature representation.
In the GoogleNet architecture shown in Fig. 6.6, 9 inception modules are stacked together,
which results in a 22 layer deep network (the total number of layers in the network is > 100).
Similar to NiN, GoogleNet uses a global average pooling followed by a fully connected layer
toward the end of the network for classiﬁcation. The global average pooling layer provides faster
computations with a better classiﬁcation accuracy and a much reduced number of parameters.
Another intuitive characteristic of the GoogleNet design is the availability of several output
branches in the intermediate layers (e.g., after 4a and 4d), where classiﬁers are trained on the
end task. This design feature avoids the vanishing gradients problem by passing strong feedback
signals to the initial layers due to the classiﬁcation branches extended from the initial layers.
GoogleNet also uses dropout before the ﬁnal fully connected layers in each output branch for
regularization.
Although the GoogleNet architecture looks much more complex than its predecessors,
e.g., AlexNet and VGGnet, it involves a signiﬁcantly reduced number of parameters (6 million
compared to 62 million in AlexNet and 138 million parameters in VGGnet). With a much
smaller memory footprint, a better eﬃciency and a high accuracy, GoogleNet is one of the most
intuitive CNN architectures which clearly demonstrates the importance of good design choices.

6.6

RESNET

The Residual Network [He et al., 2016a] from Microsoft won the ILSVRC 2015 challenge
with a big leap in performance, reducing the top-5 error rate to 3.6% from the previous year’s
winning performance of 6.7% by GoogleNet [Szegedy et al., 2015]. It is worth mentioning here
that the ILSVRC 2016 winners obtained an error rate of 3.0% using an ensemble of previous
popular models such as the GoogleNet Inception and the Residual Network, along with their
variants (e.g., Wide Residual Networks and the Inception-ResNets) [ILS].
The remarkable feature of the residual network architecture is the identity skip connections
in the residual blocks, which allows to easily train very deep CNN architectures. To understand
these connections, consider the residual block in Fig. 6.7. Given an input x, the CNN weight
layers implement a transformation function on this input, depicted by F .x/. In a residual block,

6.6. RESNET

Input Image
(224×224×3)

7×7 (64)
Stride 2, Padding 3

3×3
Stride 2

3×3 (192)
Padding 1

3×3
Stride 2

5×5
(Stride 3) 128 1024

Inception 3a

Inception 3b

107

1000

Inception 4a

Softmax 1

Inception 4d

Softmax 2

3×3 (Stride 2)

Inception 4b

Inception 4c

5×5 128 1024
(Stride 3)

Inception 4e

Inception 5a

Inception 5b

3×3 (Stride 2)

1000

Softmax 3

7×7

1000

Fully Connected Layers

Convolution Layers

Dropout Layers

Average Pooling

Max-pooling Layers

Local Response Normalization

Figure 6.6: The GoogleNet Architecture. All the inception modules have the same basic architecture as described in Fig. 6.5b, however, the number of ﬁlters in each layer are diﬀerent for
each module. As an example, Inception 3a has 64 ﬁlters in the top-most branch (see Fig. 6.5b),
96 and 128 ﬁlters in the second branch from top, 16 and 32 ﬁlters in the third branch from top
and 32 ﬁlters in the bottom branch. In contrast, Inception 4a has 192 ﬁlters in the top-most
branch (in Fig. 6.5b), 96 and 208 ﬁlters in the second branch from top, 16 and 48 ﬁlters in the
third branch from top and 64 ﬁlters in the bottom branch. Interested readers are referred to
Table 1 in Szegedy et al. [2015] for the exact dimensions of all layers in each inception block.

108

6. EXAMPLES OF CNN ARCHITECTURES

the original input is added to this transformation using a direct connection from the input,
which bypasses the transformation layers. This connection is called the “skip identity connection.”
In this way, the transformation function in a residual block is split into an identity term (which
represents the input) and a residual term, which helps in focusing on the transformation of the
residue feature maps (Fig. 6.7). In practice, such an architecture achieves a stable learning of
very deep models. The reason is that the residual feature mapping is often much simpler than
the unreferenced mapping learned in the conventional architectures.
Just like the inception module in GoogleNet, the Residual Network comprises multiple
residual blocks stacked on top of each others. The winning model of ILSVRC consisted of 152
weights layers (about 8 deeper than the VGGnet-19), which is shown along with a 34 layer
model in Table 6.1. Figure 6.8 shows the 34 layer Residual Network as a stack of multiple residual
blocks. Note that in contrast to ResNet, a very deep plain network without any residual connections achieves much higher train and test error rates. This shows that the residual connections
are key to a better classiﬁcation accuracy of such deep networks.
The weight layers in the residual block in Fig. 6.7 are followed by a batch normalization
and a ReLU activation layer. In this design, the identity mapping has to pass through the ReLU
activation after addition with the output of the weight layers. A follow-up work from the authors
of Residual Networks demonstrated that this “post-activation” mechanism can be replaced by a
“pre-activation” mechanism, where the batch normalization and ReLU layers are placed before
the weight layers [He et al., 2016b]. This leads to a direct “unhindered” identity connection
from the input to the output which further enhances the feature learning capability of very deep
networks. This idea is illustrated in Fig. 6.9. With this modiﬁed design of the residual blocks,
networks of a depth of 200 layers were shown to perform well without any over-ﬁtting on the
training data (in contrast to the previous design of Residual Networks which started over-ﬁtting
for the same number of layers).
Another similar network architecture is the Highway network [Srivastava et al., 2015],
which adds learnable parameters to the shortcut and the main branches. This acts as a switching
mechanism to control the signal ﬂow through both the main and the shortcut branches and
allows the network to decide which branch is more useful for the end-task.

6.7

RESNEXT

The ResNeXt architecture combines the strengths of GoogleNet and ResNet designs. Speciﬁcally, it utilizes skip connections which were proposed for residual networks and combines them
with the multi-branch architecture in the inception module. This applies a set of transformations
to the input feature maps and merges the resulting outputs together before feed-forwarding the
output activations to the next module.
The ResNext block is diﬀerent to the inception module in three key aspects. First, it contains a considerably large number of branches compared to the inception module. Second, in
contrast to diﬀerent sizes of ﬁlters in diﬀerent branches of the inception architecture, the se-

6.7. RESNEXT

109

Residual
Block (f,n)

Convolution Layer
(‘f×f ’ size, ‘n’ ﬁlters)

F (x)

Relu Non-linearity

Identity
Connection

Convolution Layer
(‘f×f ’ size, ‘n’ ﬁlters)

Relu Non-linearity
x + F (x)

Figure 6.7: The Residual Block.

Residual Block
(3×3, 64)

7×7 (64)
Stride 2

Input Image

Residual Block
(3×3, 128)

Residual Block
(3×3, 64)

Residual Block
(3×3, 128)

3×3
Stride 2

Residual Block
(3×3, 128)

Residual Block
(3×3, 256)

Residual Block
(3×3, 256)
Softmax 1

Residual Block
(3×3, 256)

Residual Block
(3×3, 512)

Residual Block
(3×3, 512)
7×721000
Softmax

Fully Connected Layers

Convolution Layers

Figure 6.8: The Residual Network architecture.

Average Pooling

Max-pooling Layers

110

6. EXAMPLES OF CNN ARCHITECTURES

Table 6.1: The ResNet Architecture for a 34-layer (left) and a 152-layer network (right). Each
residual block is encompassed using a right brace (g) and the number of similar residual blocks is
denoted beside each brace (e.g., 3 denotes three consecutive residual blocks). The residual block
in the 34-layer network consists of two convolution layers each with 3 3 ﬁlters. In contrast,
each residual block in the 152-layer network consists of three convolution layers with 1 1,
3 3 and 1 1 ﬁlter sizes, respectively. This design of a residual block is called the “bottleneck
architecture” because the ﬁrst convolution layer is used to reduce the number of feature channels
coming from the previous layer (e.g., the ﬁrst convolution layer in the second group of residual
blocks in the 152-layer network reduce the number of feature channels to 128). Note that the
depth of the architecture can be increased by simply replicating the basic residual block which
contains identity connections (Fig. 6.7). For clarity, the 34-layer network architecture has also
been illustrated in Fig. 6.8.
ResNet (34 Layers)
cnv–7×7 (64), stride 2
maxpool–3×3, stride 2
cnv–3×3 (64)
×3
cnv–3×3 (64)
cnv–3×3 (128)
cnv–3×3 (128)

×4

cnv–3×3 (256)
cnv–3×3 (256)

×6

cnv–3×3 (512)
cnv–3×3 (512)

×3

avgpool
cnv–1×1 (1000)
softmax loss

ResNet (152 Layers)
cnv–7×7 (64), stride 2
maxpool–3×3, stride 2
cnv–1×1 (64)
cnv–3×3 (64)
cnv–1×1 (256)
cnv–1×1 (128)
cnv–3×3 (128)
cnv–1×1 (512)
cnv–1×1 (256)
cnv–3×3 (256)
cnv–1×1 (1024)
cnv–1×1 (512)
cnv–3×3 (512)
cnv–1×1 (2048)
avgpool
cnv–1×1 (1000)
softmax loss

×3

×8

×36

×3

6.7. RESNEXT

Weight

ReLU

Weight

ReLU

111

Weight

Addition
ReLU

Addition

(a) Original

(b) Proposed

Figure 6.9: The residual block with pre-activation mechanism [He et al., 2016b]. (Figure used
with permission.)
quence of transformations in each branch are identical to other branches in the ResNext module.
This simpliﬁes the hyper-parameter choices of the ResNext architecture, compared to the inception module. Finally, the ResNext block contains skip connections which have been found to
be critical in the training of very deep networks. A comparison between a ResNet and ResNext
block is shown in Fig. 6.10. Note that the multi-branch architecture, followed by the aggregation
of responses, leads to an improved performance compared to ResNet. Speciﬁcally, a 101-layer
ResNeXt architecture was able to achieve a better accuracy than ResNet with almost a double
size (200 layers).
The overall ResNext architecture is identical to the ResNet design shown in Table 6.1.
The only diﬀerence is that the residual blocks are replaced with the ResNext blocks described
above. For eﬃciency purposes, the group of transformations in each ResNext block are implemented as grouped convolutions, where all the N transformation paths are joined in a single layer
with a number of output channels equal to N C , when C is the number of output channels in
each individual transformation layer. As an example, the ﬁrst layer in a ResNext block, shown in
Fig. 6.10, is implemented as a single layer with 32 4 output channels. The grouped convolutions only allow convolutions within each of the 32 groups, eﬀectively resembling the multi-path
architecture shown in the ResNext block.

112

6. EXAMPLES OF CNN ARCHITECTURES
256 Dimensional Input

256 Dimensional Input

Conv (1×1, 4)

Conv (3×3, 4)

Conv (1×1, 256)

Conv (1×1, 4)

Conv (1×1, 64)
Total 32
Branches

Conv (3×3, 4)

Conv (3×3, 64)
Conv (1×1, 256)

Conv (1×1, 256)

256 Dimensional Output

(a) A ResNet Block with
Bottleneck Architecture

(b) A ResNext Block with Bottleneck Architecture

Figure 6.10: A ResNet (a) and a ResNext (b) block with roughly the same number of parameters. While the ResNet block comprises a single main branch alongside a skip connection, the
ResNext block consists of a number of branches (formally termed the “block cardinality”), each
of which implements a transformation. The proposed ResNext block has 32 branches, where
each transformation is shown as the number of input channels ﬁlter dimensions number of
output channels.

6.8

FRACTALNET

The FractalNet design is based on the observation that the residual-based learning is not the only
key enabler for a successful training of very deep convolutional networks. Rather, the existence of
multiple shortcut paths for (both forward and backward) information ﬂow in a network enable a
form of deep supervision, which helps during the network training. The fractal design achieves
a better performance than its residual counterparts on MNIST, CIFAR-10, CIFAR-100, and
SVHN datasets when no data augmentation is used during model training [Larsson et al., 2016].
The fractal design is explained in Fig. 6.11. Instead of a single main branch (as in VGG)
or a two branch network where on of them is an identity connection which learns residual functions, the fractal design consists of multiple branches each of which has a diﬀerent number of
convolution layers. The number of convolution layers depends on the column number of the
branch in the fractal block. If c represents the column number, the number of layers in the c th
branch is 2c 1 . For example, the ﬁrst column from the left has only one weight layer, the second
column has two weight layers, the third has four and the fourth has eight weight layers. If C
denotes the maximum number of columns in a FractalNet block, the total depth of the block

6.8. FRACTALNET

113

C 1

. Note that the outputs from multiple branches are averaged together to produce a
will be 2
joint output.
Fractal Expansion Rule

Block 1

Block 2

Block 3

Block 4
Layer Key
Convolution
Join
Pool
Prediction

Block 5

Figure 6.11: Left: The basic fractal design expansion rule is shown. It contains two branches
for information ﬂow. Each branch transforms the input using a single (left branch) or multiple transformations (right branch). Right: A FractalNet [Larsson et al., 2016] with ﬁve fractal
blocks. Each block consists of C columns and H hierarchies of fractals. In the example block
shown, C D 4 and H D 3. (Figure used with permission.)
In order to avoid redundant features representations and co-adaptations between the
branches, FractalNet utilizes “path-dropout” for the regularization of the network. The pathdropout randomly ignores one of the incoming paths during the joining operation. Although
this approach (called the “local drop-path”) acts as a regularizer, multiple paths may still be available from input to output. Therefore, a second version of path-dropout (called the “global droppath”) was also introduced where only a single column was randomly picked in a FractalNet.
Both these regularizations were alternatively applied during the network training to avoid model
over-ﬁtting.

114

6. EXAMPLES OF CNN ARCHITECTURES

6.9

DENSENET

The use of skip (and shortcut) connections in the previously discussed architectures, such as
ResNet, FractalNet, ResNext, and Highway networks avoid the vanishing gradient problem
and enable the training of very deep networks. DenseNet extends this idea by propagating the
output of each layer to all subsequent layers, eﬀectively easing the information propagation in
the forward and backward directions during network training [Huang et al., 2016a]. This allows
all layers in the network to “talk” to each other, and to automatically ﬁgure out the best way to
combine multi-stage features in a deep network.

Input Feature Maps

DenseNet Block

Conv
Conv
(1×1, 4k) (3×3, k)

Transition Layer

Conv
Conv
(1×1, 4k) (3×3, k)

Conv Avg Pool
(1×1) (2×2, k)

Figure 6.12: A densely connected CNN block with ﬁve layers. Note that every layer receives
features from all preceding layers. A transition layer is used toward the end of each dense block
to reduce the feature set dimensionality.
An example DenseNet is shown in Fig. 6.12. Since the information ﬂow is in the forward
direction, every initial layer is directly connected to all the following layers. These connections
are realized by concatenating the feature maps of a layer with the feature maps from all the
preceding layers. This is in contrast to ResNet, where skip connections are used to add the
output from an initial layer before processing it through the next ResNet block (see Fig. 6.10).
There are three main consequences of such direct connections in a DenseNet. First, the features
from the initial layers are directly passed on to the later layers without any loss of information.
Second, the number of input channels in the subsequent layers of a DenseNet increases rapidly
because of the concatenation of the preceding feature maps from all preceding layers. To keep
the network computations feasible, the number of output channels (also termed the “growth rate”
in DenseNets) for each convolutional layer are relatively quite small (e.g., 6 or 12). Third, the
concatenation of feature maps can only be performed when their spatial sizes match with each
other. Therefore, a densely connected CNN consists of multiple blocks, each of which has dense
connections within the layers and the pooling or strided convolution operations are performed

6.9. DENSENET

115

in-between these blocks to collapse the input to a compact representation (see Fig. 6.13). The
layers which reduce the size of feature representations between each pair of DenseNet blocks
are called the “transition layers.” The transition layers are implemented as a combination of batch
normalization, an 1×1 convolutional layer and by a 2×2 average pooling layer.

Figure 6.13: Variants of DenseNet Architectures [Huang et al., 2016a]. In each variant, the
number of dense blocks remain the same (i.e., 4), however the growth rate and the number
of convolution layers are increased to design larger architectures. Each transition layer is implemented as a combination of dimensionality reduction layer (with 1 1 convolutional ﬁlters)
and an average pooling layer for sub-sampling. (Table used with permission from Huang et al.
[2016a].)
The concatenated feature maps from the preceding layers in a DenseNet block are not
tunable. Therefore, each layer learns its own representation and concatenates it with the global
information which comes from the previous network stages. This information is then provided
to the next layer which can add additional information, but cannot directly alter the global
information that was learned by the previous layers. This design greatly reduces the number of
tunable parameters and explicitly diﬀerentiates between the global state of the network and the
local contribution additions, that are made by each layer to the global state.
DenseNet borrows several design choices from the previous best approaches. For example,
a pre-activation mechanism where each convolution layer is preceded by a Batch Normalization
and a ReLU layer. Similarly, bottleneck layers with 1 1 ﬁlters are used to ﬁrst reduce the
number of input feature maps before processing them through the 3 3 ﬁlter layers. DenseNet
achieves state of the art performance on a number of datasets such as MNIST, CIFAR10, and
CIFAR100 with a relatively less number of parameters compared to ResNet.

117

CHAPTER

Applications of CNNs in
Computer Vision
Computer vision is a very broad research area which covers a wide variety of approaches not only
to process images but also to understand their contents. It is an active research ﬁeld for convolutional neural network applications. The most popular of these applications include, classiﬁcation,
segmentation, detection, and scene understanding. Most CNN architectures have been used for
computer vision problems including, supervised or unsupervised face/object classiﬁcation (e.g.,
to identify an object or a person in a given image or to output the class label of that object), detection (e.g., annotate an image with bounding boxes around each object), segmentation (e.g.,
labeling pixels of an input image), and image generation (e.g., converting low-resolution images
to high resolution ones). In this chapter, we describe various applications of convolutional neural
networks in computer vision. Note that this chapter is not a literature review, it rather provides
a description of representative works in diﬀerent areas of computer vision.

7.1

IMAGE CLASSIFICATION

CNNs have been shown to be an eﬀective tool for the task of image classiﬁcation [Deng, 2014].
For instance, in the 2012 ImageNet LSVRC contest, the ﬁrst large-scale CNN model, which
was called AlexNet [Krizhevsky et al., 2012] (Section 6.2), achieved considerably lower error
rates compared to the previous methods. ImageNet LSVRC is a challenging contest as the
training set consists of about 1.2 million images belonging to 1,000 diﬀerent classes, while the
test set has around 150,000 images. After that, several CNN models (e.g., VGGnet, GoogleNet,
ResNet, DenseNet) have been proposed to further decrease the error rate. While Chapter 6 has
already introduced these state-of-the art CNN models for image classiﬁcation task, we will
discuss below a more recent and advanced architecture, which was proposed for 3D point cloud
classiﬁcation (input to the model is a raw 3D point cloud) and achieved a high classiﬁcation
performance.

7.1.1 POINTNET
PointNet [Qi et al., 2016] is a type of neural network which takes orderless 3D point clouds as
input and well respects the permutation invariance of the input points. More precisely, PointNet
approximates a set function g deﬁned on an orderless 3D point cloud, fx1 ; x2 ; ; xn g, to map

118

7. APPLICATIONS OF CNNS IN COMPUTER VISION

the point cloud to a vector:
g.fx1 ; x2 ; ; xn g/ D . max fh.xi /g/;

(7.1)

i D1n

where and h are multi-layer perceptron (mlp) networks. Thus, the set function g is invariant to
the permutation of the input points and can approximate any continuous set function [Qi et al.,
2016].
PointNet, shown in Fig. 7.1, can be used for classiﬁcation, segmentation, and scene semantic parsing from point clouds. It directly takes an entire point cloud as input and outputs
a class label to accomplish 3D object classiﬁcation. For segmentation, the network returns perpoint labels for all points within the input point cloud. PointNet has three main modules, which
we brieﬂy discuss below.

n×64

Input Transform T-Net

Submodule A-1

Matrix
Multiply
Feature TransformT-Net

Submodule A-2

g
Global
Feature

Output
Scores

Point Features

64×64
Transform

n
Matrix
Multiply

mlp
(512, 256, k)

f
1024

× 1088

Shared

n×m

n×64
T-Net

Shared

Max
Pool

Output
Scores

3×3
Transform

Shared

mlp(64, 128, 1024)

n×128

T-Net

n×3

3D Points

Classiﬁcation Network

Feature
Transform

n×1024

Module A
mlp(64, 64)

Input
Transform

mlp (512, 256, 128)
mlp (128, m)
Segmentation Network

Module B

Figure 7.1: PointNet architecture: The input to classiﬁcation network (Module A) are 3D point
clouds. The network applies a sequence of non-linear transformations, including input and feature transformations (Submodule A-1), and then uses max-pooling to aggregate point features.
The output is a classiﬁcation score for C number of classes. The classiﬁcation network can be
extended to form the segmentation network (Module B). “mlp” stands for multi-layered perceptron and numbers in bracket are layer sizes.

Classiﬁcation Network (Module A):
The ﬁrst key module of PointNet is the classiﬁcation network (module A), shown in Fig. 7.1.
This module consists of input (Submodule A-1) and feature (Submodule A-2) transformation
networks, multi-layer perceptrons (mlp) networks, and a max-pooling layer. Each point within
the orderless input point cloud (fx1 ; x2 ; ; xn g) is ﬁrst passed through two shared mlp networks
(function h./ in Eq. (7.1)) to transfer 3D points to a high dimensional, e.g., 1024-dimensional,
feature space. Next, the max-pooling layer (function max./ in Eq. (7.1)) is used as a symmetric

7.1. IMAGE CLASSIFICATION

119

function to aggregate information from all the points, and to make a model invariant to input
permutations. The output of the max-pooling layer ( max fh.xi /g in Eq. (7.1)) is a vector, which
i D1n

represents a global shape feature of the input point cloud. This global feature is then passed
through an mlp network followed by a soft-max classiﬁer (function ./ in Eq. (7.1)) to assign
a class label to the input point cloud.

Transformation/Alignment Network (Submodule A-1 and Submodule A-2):
The second module of PointNet is the transformation network (mini-network or T-net in
Fig. 7.1), which consists of a a shared mlp network, that is applied on each point, followed by a
max-pooling layer which is applied across all points and two fully connected layers. This network
predicts an aﬃne transformation to ensure that the semantic labeling of a point cloud is invariant to geometric transformations. As shown in Fig. 7.1, there are two diﬀerent transformation
networks, including the input (Submodule A-1) and the feature transformation (Submodule
A-2) networks. The input 3D points are ﬁrst passed through the input transformation network
to predict a 3 3 aﬃne transformation matrix. Then, new per-point features are computed by
applying the aﬃne transformation matrix to the input point cloud (“Matrix multiply” box in
Submodule A-1).
The feature transform network is a replica of the input transform network, shown in
Fig. 7.1 (Submodule A-2), and is used to predict the feature transformation matrix. However, unlike the input transformation network, the feature transformation network takes 64dimensional points. Thus, its predicted transformation matrix has a dimension of 64 64, which
is higher than the input transformation matrix (i.e., 3 3 in Submodule A-1). This contributes
to the diﬃculty of achieving optimization. To address this problem, a regularization term is
added to the soft-max loss to constrain the 64 64 feature transformation matrix to be close to
an orthogonal matrix:

Lreg D I

AAT

2
F

;

(7.2)

where A is the feature transformation matrix.
Segmentation Network (Module B):
PointNet can be extended to predict per point quantities, which rely on both the local geometry
and the global semantics. The global feature computed by the classiﬁcation network (Module
A) is fed to the segmentation network (Module B). The segmentation network combines the
global and per-point point features, and then passed them through an mlp network to extract
new per-point features. These new features contain both the global and local information, which
has shown to be essential for segmentation [Qi et al., 2016]. Finally, the point features are passed
through a shared mlp layer to assign a label to each point.

120

7. APPLICATIONS OF CNNS IN COMPUTER VISION

7.2

OBJECT DETECTION AND LOCALIZATION

Recognizing objects and localizing them in images is one of the challenging problems in computer vision. Recently, several attempts have been made to tackle this problem using CNNs. In
this section, we will discuss three recent CNN-based techniques, which were used for detection
and localization.

7.2.1 REGION-BASED CNN
In Girshick et al. [2016], Region-based CNN (R-CNN) has been proposed for object detection.
It will be good to give here the broad idea behind R-CNN, i.e., region wise feature extraction
using deep CNNs and the learning of independent linear classiﬁers for each object class. The
R-CNN object detection system consists of the following three modules.
Regional Proposals (Module A in Fig. 7.2) and Feature Extraction (Module B)
Given an image, the ﬁrst module (Module A in Fig. 7.2) uses selective search [Uijlings et al.,
2013] to generate category-independent region proposals, which represent the set of candidate
detections available to the object detector.
Input
Image

Compute
CNN Features

Extract Region
Proposals

Classify Regions
Aeroplane? No.

Person? Yes.

Warped
Region
Module A

TV monitor? No.

Module B

Module C

Figure 7.2: RCNN object detection system. Input to R-CNN is an RGB image. It then extracts
region proposals (Module A), computes features for each proposal using a deep CNN, e.g.,
AlexNet (Module B), and then classiﬁes each region using class-speciﬁc linear SVMs (Module
C).
The second module (Module B in Fig. 7.2) is a deep CNN (e.g., AlexNet or VGGnet),
which is used to extract a ﬁxed-length feature vector from each region. In both cases (AlexNet or
VGGnet), the feature vectors are 4096-dimensional. In order to extract features from a region
of a given image, the region is ﬁrst converted to make it compatible with the network input.
More precisely, irrespective of the candidate region’s aspect ratio or size, all pixels are converted
to the required size by warping them in a tight bounding box. Next, features are computed by
forward propagating a mean-subtracted RGB image through the network and reading oﬀ the
output values by the last fully connected layer just before the soft-max classiﬁer.

7.2. OBJECT DETECTION AND LOCALIZATION

121

Training Class Speciﬁc SVMs (Module C)
After feature extraction, one linear SVM per class is learned, which is the third module of this
detection system. In order to assign labels to training data, all region proposals with an overlap
greater than or equal to 0.5 IoU with a ground-truth box are considered as positives for the class
of that box, while the rest are considered as negatives. Since the training data is very large for
the available memory, the standard hard negative mining method [Felzenszwalb et al., 2010] is
used to achieve quick convergence.
Pre-training and Domain Speciﬁc Training of Feature Extractor
For the pre-training of the CNN feature extractor, a large ILSVRC2012 classiﬁcation dataset
with image-level annotations is used. Next, SGD training with the warped region proposals is
performed for adaptation of the network to the new domain and the new task of image detection.
The network architecture is kept unchanged except for a 1000-way classiﬁcation layer, which is
set to 20C1 for PASCAL VOC and to 200C1 for ILSVRC2013 datasets (equal to the number
of classes in these datasets + background).
R-CNN Testing
At test time, a selective search is run to select 2,000 region proposals from the test image. To
extract features from these regions, each one of them is warped and then forward propagated
through the learned CNN feature extractor. SVM trained for each class is then used to score the
feature vectors of each class. Once all these regions have been scored, a greedy non-maximum
suppression is used to reject a proposal, which has an IoU overlap with a higher scoring selected
region greater than a pre-speciﬁed threshold. R-CNN has been shown to improve the mean
average precision (mAP) by a signiﬁcant margin.
R-CNN Drawbacks
The R-CNN discussed above achieves an excellent object detection accuracy. However, it still
has few drawbacks. The training of R-CNN is multi-staged. In the case of R-CNN, soft-max
loss is employed to ﬁne tune the CNN feature extractor (e.g., AlexNet, VGGnet) on object
proposals. SVM are next ﬁtted to the network’s features. The role of SVMs is to perform object
detection and replace the soft-max classiﬁer. In terms of complexity in time and space, the
training of this model is computationally expensive. This is because, every region proposal is
required to pass through the network. For example, training VGGnet-16 on 5,000 images of the
PASCAL VOC07 dataset using a GPU takes 2.5 days. In addition, the extracted features need
a huge storage, i.e., hundreds of gigabytes. R-CNN is also slow in performing object detection
at test time. For example, using VGGnet-16 (on a GPU) as the feature extractor in Module B,
detection of an object takes around 47 seconds per image. To overcome these problems, an
extension of R-CNN called the Fast R-CNN was proposed.

122

7. APPLICATIONS OF CNNS IN COMPUTER VISION

7.2.2 FAST R-CNN
Figure 7.3 shows the Fast R-CNN model. The input to Fast R-CNN is an entire image along
with object proposals, which are extracted using the selective search algorithm [Uijlings et al.,
2013]. In the ﬁrst stage, convolutional feature maps (usually feature maps of the last convolutional layer) are extracted by passing the entire image through a CNN, such as AlexNet and
VGGnet (Module A in Fig. 7.3). For each object proposal, a feature vector of ﬁxed size is then
extracted from the feature maps by a Region of Interest (RoI) pooling layer (Module B), which
has been explained in Section 4.2.7. The role of the RoI pooling layer is to convert the features,
in a valid RoI, into small feature maps of ﬁxed size (X Y , e.g., 7 7), using max-pooling. X
and Y are the layer hyper-parameters. A RoI itself is a rectangular window which is characterized by a 4-tuple that deﬁnes its top-left corner (a, b ) and its height and width (x ,y ). The RoI
layer divides the RoI rectangular area of size x y into an X Y grid of sub-windows, which
has a size of x=X y=Y . The values in each sub-window are then max-pooled into the corresponding output grid. Note that the max-pooling operator is applied independently to each
convolutional feature map. Each feature vector is then given as input to fully connected layers,
which branch into two sibling output layers. One of these sibling layers (Module C in Fig. 7.3)
gives estimates of the soft-max probability over object classes and a background class. The other
layer (Module D in Fig. 7.3) produces four values, which redeﬁne bounding box positions, for
each of the object classes.
Outputs:
Softmax

Bbox
Regressor

Deep
ConvNet

RoI
Pooling
Layer

FCs
RoI
Projection

Conv
Feature Map

RoI Feature
Vector

For each RoI

Figure 7.3: Fast R-CNN Architecture. The input to a fully convolutional network (e.g., AlexNet)
is an image and RoIs (Module A). Each RoI is pooled into a ﬁxed-size feature map (Module B)
and then passed through the two fully-connected layers to form a RoI feature vector. Finally, the
RoI feature vector is passed through two sibling layers, which output the soft-max probability
over diﬀerent classes (Module C) and bounding box positions (Module D).

7.2. OBJECT DETECTION AND LOCALIZATION

123

Fast R-CNN Initialization and Training
Fast R-CNN is initialized from three pre-trained ImageNet models including:
AlexNet [Krizhevsky et al., 2012], VGG-CNN-M-1024 [Chatﬁeld et al., 2014], and
VGGnet-16 [Simonyan and Zisserman, 2014b] models. During initialization, a Fast R-CNN
model undergoes three transformations. First, a RoI pooling layer (Module B) replaces the last
max-pooling layer to make it compatible with the network’s ﬁrst fully connected layer (e.g., for
VGGnet-16 X = Y = 7). Second, the two sibling layers ((Module C and Module D), discussed
above, replace the last fully connected layer and soft-max of the network. Third, the network
model is tweaked to accept two data inputs, i.e., a list of images and their RoIs. Then SGD
simultaneously optimizes the soft-max classiﬁer (Module C) and the bounding-box regressor
(Module D) in an end-to-end fashion by using a multi-task loss on each labeled RoI.

Detection Using Fast R-CNN
For detection, the Fast R-CNN takes as an input, an image along with a list of R object proposals to score. During testing, R is kept at around 2,000. For each test RoI (r ), the class probability
score for each of the K classes (module C in Fig. 7.3) and a set of reﬁned bounding boxes are
computed (module D in Fig. 7.3). Note that each of the K classes has its own reﬁned bounding
box. Next, the algorithm and the conﬁgurations from R-CNN are used to independently perform non-maximum suppression for each class. Fast R-CNN has been shown to achieve a higher
detection performance (with a mAP of 66% compared to 62% for R-CNN) and computational
eﬃciency on PASCAL VOC 2012 dataset. The training of Fast R-CNN, using VGGnet-16,
was 9 faster than R-CNN and this network was found 213 faster at test-time.
While Fast R-CNN has increased the training and testing speed of R-CNN by sharing a
single CNN computation across all region proposals, its computational eﬃciency is bounded by
the speed of the region proposal methods, which are run on CPUs. The straightforward approach
to address this issue is to implement region proposal algorithms on GPUs. Another elegant way
is to rely on algorithmic changes. In the following section, we discuss an architecture, known
as Regional Proposal Network [Ren et al., 2015], which relies on an algorithmic change by
computing nearly cost-free region proposals with CNN in an end-to-end learning fashion.

7.2.3 REGIONAL PROPOSAL NETWORK (RPN)
The Regional Proposal Network [Ren et al., 2015] (Fig. 7.4) simultaneously predicts object
bounding boxes and objectness scores at each position. RPN is a fully convolutional network,
which is trained in an end-to-end fashion, to produce high quality region proposals for object
detection using Fast R-CNN (described in Section 7.2.2). Combining RPN with Fast R-CNN
object detector results in a new model, which is called Faster R-CNN (shown in Fig. 7.5).
Notably, in Faster R-CNN, the RPN shares its computation with Fast R-CNN by sharing the
same convolutional layers, allowing for joint training. The former has ﬁve, while the later has

124

7. APPLICATIONS OF CNNS IN COMPUTER VISION

Module C
2r Scores
cls Layer

Module D

Module B

256-d
For ZF: 256-d
For VGG Net: 512-d

Module A

r Anchor Boxes

4r Coordinates
reg Layer

sliding window

Conv Feature Map

Figure 7.4: Regional proposal network (RPN) architecture, which is illustrated at a single position.
Classiﬁer
RoI Pooling

Proposals

Region Proposal Network
Feature Maps

Conv Layers
Image

Figure 7.5: Faster R-CNN. Combining RPN with fast R-CNN object detector.

7.2. OBJECT DETECTION AND LOCALIZATION

125

13 shareable convolutional layers. In the following, we ﬁrst discuss the RPN and then Faster
R-CNN, which merges RPN and Fast R-CNN into a single network.
The input to the RPN is an image (re-scaled in such a way that its smaller side is equal to
600 pixels) and the output is a set of object bounding boxes and associated objectness scores (as
shown in Fig. 7.4). To generate region proposals with RPN, the image is ﬁrst passed through
the shareable convolutional layers. A small spatial window (e.g., 3 3) is then slid over the output feature maps of the last shared convolutional layer, e.g., conv5 for VGGnet-16 (Module A
in Fig. 7.4). This sliding window at each position is then transformed to a lower-dimensional
vector of size 256-d and 512-d for ZF and VGGnet-16 models (Module B in Fig. 7.4), respectively. Next, this vector is given as an input to two sibling fully connected layers, including the
bounding box regression layer (reg) and the bounding box classiﬁcation layer (cls) (Module C
in Fig. 7.4). In summary, the above-mentioned steps (Module A, Module B, and Module C)
can be implemented with a 3 3 convolutional layer followed by two sibling 1 1 convolutional
layers.
Multi-scale Region Proposal Detection
Unlike image/feature pyramid-based methods, such as the spatial pyramid pooling (SPP)
layer [He et al., 2015b] (discussed in Section 4.2.8), which uses time-consuming feature pyramids in convolutional feature maps, RPN uses a nearly cost-free algorithm for addressing multiple scales and aspect ratios. For this purpose, r region proposals are simultaneously predicted
at the location of each sliding window. The reg layer (Module C in Fig. 7.4) therefore encodes
the coordinates of r boxes by producing 4r outputs. To predict the probability of “an object” or
“no object” in each proposal, the cls layer outputs 2r scores (the sum of outputs is 1 for each
proposal). The r proposals are parameterized relative to anchors (i.e., r reference boxes which
are centered at the sliding window as shown in Module D). These anchors are associated with
three diﬀerent aspect ratios and three diﬀerent scales, resulting in a total of nine anchors at each
sliding window.
RPN Anchors Implementation
For the anchors, 3 scales with box areas of 1282 , 2562 , and 5122 pixels, and 3 aspect ratios of
2:1, 1:1, and 1:2 are used. When predicting large proposals, the proposed algorithm allows for
the use of anchors that are larger than the receptive ﬁeld. This design helps in achieving high
computational eﬃciency, since multi-scale features are not needed.
During the training phase, all cross-boundary anchors are avoided to reduce their contribution to the loss (otherwise, the training does not converge). There are roughly 20k anchors,
i.e., 60 40 9, used for a 1,000 600 image.1 By ignoring the cross-boundary anchors only
around 6k anchors per image are left for training. During testing, the entire image is given as
1 Since

the total stride for both ZF and VGG nets on the last convolutional layer is 16 pixels.

126

7. APPLICATIONS OF CNNS IN COMPUTER VISION

an input to the RPN. As a result, cross boundary boxes are produced which are then clipped to
the image boundary.
RPN Training
During the training of RPN, each anchor takes a binary label to indicate the presence of “an
object” or “no object” in the input image. Two types of anchors take a positive label: either an
anchor which has the highest IoU overlap or the IoU overlap is higher than 0.7 with groundtruth bounding box. An anchor that has an IoU ratio lower than 0.3 is assigned a negative label.
In addition, the contribution of anchors which do not have a positive or negative label is not
considered toward the training objective. The RPN multi-task loss function for an image is given
by:
1 X
1 X
L.fpi g; fti g/ D
Lcls .pi ; pi / C
pi Lreg .ti ; ti /;
(7.3)
Ncls
Nreg
i

where i denotes the index of an anchor, pi represents the probability that an anchor i is an
object. pi stands for the ground-truth label, which is 1 for a positive anchor, and zero otherwise.
Vectors ti and ti contain the four coordinates of the predicted bounding box and the groundtruth box, respectively. Lcls is the soft-max loss over two classes. The term pi Lreg (where Lreg
is the regression loss) indicates the activation of the regression loss only for positive anchors,
otherwise it is disabled. Ncls , Nreg , and a balancing weight are next used to normalize cls and
reg layers.
Faster R-CNN Training: Sharing Convolutional Features for Region Proposal and Object
Detection
In the previous sections, we discussed the training of the RPN network for the generation of
region proposals. We did not, however, consider Fast R-CNN for region-based object detection using these proposals. In the following, we explain the training of the Faster R-CNN network, that is composed of RPN and Fast R-CNN with shared convolutional layers, as shown in
Fig. 7.5.
Instead of learning the two networks (i.e., RPN and Fast R-CNN) separately, a fourstep alternating optimization approach is used to share convolutional layers between these two
models. Step 1: The RPN is trained using the previously stated strategy. Step 2: The proposals
produced by step 1 are used to train a separate Fast R-CNN detection network. The two networks
do not share convolutional layers at this stage. Step 3: The Fast R-CNN network initializes
the training of RPN and the layers speciﬁc to RPN are ﬁne-tuned, while keeping the shared
convolutional layers ﬁxed. At this point, the two networks share convolutional layers. Step 4:
The fc layers of the Fast R-CNN are ﬁne-tuned by keeping the shared convolutional layers ﬁxed.
As such, a uniﬁed network, which is called Faster R-CNN, is formed by using both of these
models, which share the same convolutional layers. The Faster R-CNN model has been shown
to achieve competitive detection results, with mAP of 59.9%, on PASCAL VOC 2007 dataset.

7.3. SEMANTIC SEGMENTATION

7.3

127

SEMANTIC SEGMENTATION

CNNs can also be adapted to perform dense predictions for per-pixel tasks such as semantic
segmentation. In this section, we discuss three representative semantic segmentation algorithms
which use CNN architectures.

7.3.1 FULLY CONVOLUTIONAL NETWORK (FCN)
In this section, we will brieﬂy describe the Fully Convolutional Network (FCN) [Long et al.,
2015], shown in Fig. 7.6, for semantic segmentation.

ct
io
n

Forward/Inference

n
io
at
t
n

xe
lw

ise

Pr
ed
i

Backward/Learning

g
Se

t
g.

384 384 256
96

4096 4096

256
21

Figure 7.6: FCN architecture. The ﬁrst seven layers are adopted from AlexNet. Convolutional
layers, shown in orange, have been changed from fully connected (fc6 and fc7 in AlexNet) to
convolutional layers (conv6 and conv7 in FCN). The next convolutional layer (shown in green)
is added on top of conv7 to produce 21 coarse output feature maps (21 denotes the number of
classes + background in the PASCAL VOC dataset). The last layer (yellow) is the transposed
convolutional layer, that up-samples the coarse output feature maps produced by conv8 layer.
Typical classiﬁcation networks (discussed in Chapter 6) take ﬁxed-size images as input
and produce non-spatial output maps, that are fed to a soft-max layer to perform classiﬁcation.
The spatial information is lost, because these networks use ﬁxed dimension fully connected layers
in their architecture. However, these fully connected layers can also be viewed as convolution
layers with large kernels, which cover their entire input space. For example, a fully-connected
layer with 4,096 units that takes a volume of size 13 13 256 as input, can be equivalently
expressed as a convolutional layer with 4,096 kernels of size 13 13. Hence, the dimension of

128

7. APPLICATIONS OF CNNS IN COMPUTER VISION

the output will be 1 1 4,096, which results in an identical output as the initial fully-connected
layer. Based on this reinterpretation, the convolutionalized networks can take (1) input images
of any size and produce (2) spatial output maps. These two aspects make the fully convolutional
models a natural choice for semantic segmentation.
FCN for semantic segmentation [Long et al., 2015] is built by ﬁrst converting typical classiﬁcation networks, (e.g., AlexNet, VGGnet-16) into fully convolutional networks and, then,
appending a transposed convolution layer (discussed in Section 4.2.6) to the end of the convolutionalized networks. The transposed convolution layer is used for up-sampling the coarse output feature maps produced by the last layer of the convolutionalized networks. More precisely,
given a classiﬁcation network, e.g., AlexNet, the last three fully-connected layers (4,096-D fc6,
4,096-D fc7, and 1,000-D fc8) are converted to three convolutional layers (conv6: consisting
of 4,096 ﬁlters with size 13 13, conv7: consisting of 4,096 ﬁlters with size 1 1, and conv8:
consisting of 212 ﬁlters with size 1 1) for semantic segmentation on PASCAL VOC dataset.
However, forward passing an H W 3 input image through this modiﬁed network produces
H
W
output feature maps with a spatial size of 32
32
which is 32 smaller than the spatial size
of the original input image. Thus, these coarse output feature maps are fed to the transposed
convolution layer3 to produce a dense prediction of the input image.
However, a large pixel stride, (e.g., 32 for the AlexNet) at the transposed convolution
layer limits the level of details in the up-sampled output. To address this issue, FCN is extended
to new fully convolutional networks, i.e., FCN-16s and FCN-8s by combining coarse features
with ﬁne ones that are extracted from the shallower layers. The architecture of these two FCNs
are shown in the second and third rows of Fig. 7.7, respectively. The FCN-32s in this ﬁgure is
identical to the FCN discussed above (and shown in Fig. 7.6), except that their architectures are
diﬀerent.4
FCN-16s
As shown in the second row of Fig. 7.7, FCN-16s combines predictions from the ﬁnal convolutional layer and the pool4 layer, at stride 16. Thus, the network predicts ﬁner details compared
to FCN-32s, while retaining high-level semantic information. More precisely, the class scores
computed on top of conv7 layer are passed through a transposed convolutional layer (i.e., 2
up-sampling layer) to produce class scores for each of the PASCAL VOC classes (including
background). Next, a 1 1 convolutional layer with a channel dimension of 21 is added on top
of pool4 layer to predict new scores for each of the PASCAL classes. Finally, these 2 stride
16 predicted scores are summed up and fed to another transposed convolutional layer (i.e., 16
up-sampling layer) to produce prediction maps with the same size as the input image. FCN-16s
2 Equal

to the number of classes in the PASCAL VOC dataset plus the background class.
32 pixel stride is used to output a dense prediction at the same scale as the input image.
4 The original FCN shown in Fig. 7.6 initialized with AlexNet model and the extended FCN shown in Fig. 7.7 initialized
with VGGnet-16 model.
3 The

7.3. SEMANTIC SEGMENTATION
Image Conv1

Pool1

Conv2

Pool2 Conv3

Pool3

Conv4

Pool4

129

32× Unsampled
Prediction
(FCN-32s)
Conv5 Pool5 Conv6-7

2× conv7
Pool4

4× conv7
2× Pool4
Pool3

16× Unsampled
Prediction
(FCN-16s)

8× Unsampled
Prediction
(FCN-8s)

Figure 7.7: FCN learns to combine shallow layer (ﬁne) and deep layer (coarse) information.
The ﬁrst row shows FCN-32s, which up-samples predictions back to pixels in a single step. The
second row illustrates FCN-16s, which combines predictions from both the ﬁnal and the pool4
layer. The third row shows FCN-8s, which provides further precision by considering additional
predictions from pool3 layer.
improved the performance of FCN-32s on the PASCAL VOC 2011 dataset by 3 mean IoU
and achieved 62:4 mean IoU.
FCN-8s
In order to obtain more accurate prediction maps, FCN-8s combines predictions from the ﬁnal
convolutional layer and shallower layers, i.e., pool3 and pool4 layers. The third row of Fig. 7.7
shows the architecture of the FCN-8s network. First, a 1 1 convolutional layer with channel
dimension of 21 is added on top of pool3 layer to predict new scores at stride 8 for each of the
PASCAL VOC classes. Then, the summed predicted scores at stride 16 (second row in Fig. 7.7)
is passed through a transposed convolutional layer (i.e., 2 up-sampling layer) to produce new
prediction scores at stride 8. These two predicted scores at stride 8 are summed up and then
fed to another transposed convolutional layer (i.e., 8 up-sampling layer) to produce prediction
maps with the same size as the input image. FCN-8s improved the performance of FCN-18s
by a small mean IoU and achieved 62:7 mean IoU on the PASCAL VOC 2011.
FCN Fine-tuning
Since training from scratch is not feasible due to the large number of training parameters and
the small number of training samples, ﬁne-tuning using back-propagation through the entire

130

7. APPLICATIONS OF CNNS IN COMPUTER VISION

network is done to train FCN for segmentation. It is important to note that FCN uses whole
image training instead of patch-wise training, where the network is learned from batches of
random patches (i.e., small image regions surrounding the objects of interest). Fine-tuning of
the coarse FCN-32s version, takes three GPU days, and about one day on GPU each for the
FCN-16s and FCN-8s [Long et al., 2015]. FCN has been tested on PASCAL VOC, NYUDv2,
and SIFT Flow datasets for the task of segmentation. It has been shown to achieve superior
performance compared to other reported methods.
FCN Drawbacks
FCN discussed above has few limitations. The ﬁrst issue is related to a single transposed convolutional layer of FCN, which cannot accurately capture the detailed structures of objects. While
FCN-16s and FCN-8s attempt to avoid this issue by combining coarse (deep layer) information with ﬁne (shallower layers) information, the detailed structure of objects are still lost or
smoothed in many cases. The second issue is related to the scale, i.e., ﬁxed-size receptive ﬁeld of
FCN. This causes the objects that are larger or smaller than the receptive ﬁeld to be mislabeled.
To overcome these challenges, Deep Deconvolution Network (DDN) has been proposed, which
is discussed in the following section.

7.3.2 DEEP DECONVOLUTION NETWORK (DDN)
DDN [Noh et al., 2015] consists of convolution (Module A shown in Fig. 7.8) and deconvolution networks (Module B). The convolution network acts as a feature extractor and converts the
input image to a multi-dimensional feature representation. On the other hand, the deconvolution network is a shape generator, that uses these extracted feature maps and produces class score
prediction maps at the output with a spatial size of the input image. These class score prediction
maps represent the probability that each pixel belongs to diﬀerent classes.
Module A: Convolutional Network

Module B: Deonvolutional Network

224×224
112×112
56×56

56×56
28×28

14×14

Input
Image
Max
Max Pooling
Pooling

Max
Pooling

7×7

Max
Pooling

1×1

7×7

14×14

112×112

224×224

28×28
Segmented
Image

Unpooling
Unpooling

Unpooling

Figure 7.8: Overall architecture of deep deconvolution network. A multilayer deconvolution
network is put on top of the convolution network to accurately perform image segmentation.
Given a feature representation from the convolution network (Module A), dense class score
prediction maps are constructed through multiple un-pooling and transposed convolution layers
(Module B).

7.3. SEMANTIC SEGMENTATION

131

DDN uses convolutionalized VGGnet-16 network for the convolutional part. More precisely, the last fully connected layer of VGGnet-16 is removed and the last two fully connected
layers are converted to convolutional layers (similar to FCN). The deconvolution part is the
reciprocal of the convolution network. Unlike FCN-32s, which uses a single transposed convolutional layer, the deconvolution network of DDN uses a sequence of un-pooling and transposed
convolution layers to generate class prediction maps with the same spatial size as the input image. In contrast to the convolution part of the model, which decreases the spatial size of the
output feature maps through feed-forwarding, its counterpart increases the size by combining
the transposed convolution and un-pooling layers.
Un-pooling Layer
The un-pooling layers of the deconvolution network of DDN perform the reverse operation of
the max-pooling layers of the convolution network. To be able to perform reverse max-pooling,
the max-pooling layers save the locations of the maximum activations in their “switch variables,”
i.e., essentially the argmax of the max-pooling operation. The un-pooling layers then employ
these switch variables to place the activations back to their original pooled locations.
DDN Training
To train this very deep network with relatively small training examples, the following strategies are adopted. First, every convolutional and transposed convolutional layer is followed by
a batch normalization layer (discussed in Section 5.2.4), which has been found to be critical
for DDN optimization. Second, unlike FCN which performs image-level segmentation, DDN
uses instance-wise segmentation, in order to handle objects in various scales and decrease the
training complexity. For this purpose, a two-stage training approach is used.
In the ﬁrst phase, DDN is trained with easy samples. To generate the training samples
for this phase, ground-truth bounding box annotations of objects are used to crop each object so
that the object is centered at the cropped bounding box. In the subsequent phase, the learned
model from the ﬁrst phase is ﬁne-tuned with more challenging samples. Thus, each object proposal contributes to the training samples. Speciﬁcally, the candidate object proposals, which
suﬃciently overlap with the ground-truth segmented regions ( 0.5 in IoU) are selected for
training. To include context, post-processing is also adopted in this phase.
DDN Inference
Since DDN uses instance-wise segmentation, an algorithm is required to aggregate the output
score prediction maps of individual object proposals within an image. DDN uses the pixel-wise
maximum of the output prediction maps for this purpose. More precisely, the output prediction
maps of each object proposal (gi 2 RW H C , where C is the number of classes and i , W and H
denote the index, height and width of the object proposal, respectively) are ﬁrst superimposed
on the image space with zero padding outside gi . Then, the pixel-wise prediction map of the

132

7. APPLICATIONS OF CNNS IN COMPUTER VISION

entire image is computed as follows:
P .x; y; c/ D max Gi .x; y; c/; 8i;
i

(7.4)

where Gi is the prediction map corresponding to gi in the image space with zero padding outside gi . Next, a soft-max loss is applied to the aggregated prediction maps, P in Eq. (7.4), to
obtain class probability maps (O ) in the entire image space. The ﬁnal pixel-wise labeled image is
computed by applying the fully connected CRF [Krähenbühl and Koltun, 2011] to the output
class probability maps O .
Conditional Random Field (CRF): CRF is a class of statistical modeling technique, which are categorized as the sequential version of logistic regression. In contrast to logistic regression, which is a log linear classiﬁcation
model, CRFs are log linear models for sequential labeling.
The CRF is deﬁned as the conditional probabilities of X and Y , represented
as P .Y jX /, where X denotes a multi-dimensional input, i.e., features, and
Y denotes a multi-dimensional output, i.e., labels. The probabilities can be
modeled in two diﬀerent ways, i.e., unary potential and pairwise potential.
Unary potential is used to model the probabilities that a given pixel or a
patch belongs to each particular category, while pairwise potential is deﬁned
to model the relation between two diﬀerent pixels and patches of an image.
In fully connected CRF, e.g., Krähenbühl and Koltun [2011], the latter approach is explored to establish pairwise potential on all pixel pairs in a given
image, thus resulting into reﬁned segmentation and labeling. For more details on CRF, the readers are referred to Krähenbühl and Koltun [2011].

DDN Testing
For each testing image, approximately 2,000 object proposals are generated using edge-box algorithm [Zitnick and Dollár, 2014]. Next, the best 50 proposals with the highest objectness
scores are selected. These object proposals are then used to compute instance-wise segmentations, which are then aggregated, using the algorithm discussed above, to obtain the semantic
segmentation for the whole image. DDN has been shown to achieve an outstanding performance of 72.5% on PASCAL VOC 2012 dataset compared to other reported methods.
DDN Drawbacks
DDN discussed above has a couple of limitations. First, DDN uses multiple transposed convolutional layers in its architecture, which requires additional memory and time. Second, training
DDN is tricky and requires a large corpus of training data to learn the transposed convolutional
layers. Third, DDN deals with objects at multiple scales by performing instant-wise segmentation. Therefore, it requires feed-forwarding all object proposals through the DDN, which is

7.3. SEMANTIC SEGMENTATION

133

a time-consuming process. To overcome these challenges, DeepLab model has been proposed,
which is discussed in the following section.

7.3.3 DEEPLAB
In DeepLab [Chen et al., 2014], the task of semantic segmentation has been addressed by employing convolutional layers with up-sampled ﬁlters, which are called atrous convolution (or
dilated convolution as discussed in Chapter4).
Recall that forward passing an input image through the typical convolution classiﬁcation
networks reduces the spatial scale of the output feature maps, typically by a factor of 32. However,
for dense predictions tasks, e.g., semantic segmentation, a stride of 32 pixels limits the level of
details in the up-sampled output maps. One partial solution is to append multiple transposed
convolutional layers as in FCN-8s and DDN, to the top of the convolutionalized classiﬁcation
networks to produce output maps with the same size of the input image. However, this approach
is too costly.5 Another critical limitation of these convolutionalized classiﬁcation networks is
that they have a predeﬁned ﬁxed-size receptive ﬁeld. For example, FCN and all its variants (i.e.,
FCN-16s and FCN-8s) use VGNNnet-16 with ﬁxed-size 3 3 ﬁlters. Therefore, an object with
a substantially smaller or larger spatial size than the receptive ﬁeld is problematic.6
DeepLab uses atrous convolution in its architecture to simultaneously address these two
issues. As discussed in Chapter 4, atrous convolution allows to explicitly control the spatial size
of the output feature maps that are computed within convolution networks. It also extends the
receptive ﬁeld, without increasing the number of parameters. Thus, it can eﬀectively incorporate
a wider image context while performing convolutions. For example, to increase the spatial size
of the output feature maps in the convolutionalized VGGnet-16 network7 by a factor of 2, one
could set the stride of the last max-pooling layer (pool5) to 1, and then substitute the subsequent convolutional layer (conv6, which is the convolutionalized version of fc6) with the atrous
convolutional layers with a sampling factor d D 2. This modiﬁcation also extends the ﬁlter size
of 3 3 to 5 5, and therefore enlarges the receptive ﬁeld of the ﬁlters.
DeepLab-LargeFOV Architecture
DeepLab-LargeFOV is a CNN with an atrous convolutional layer which has a large receptive
ﬁeld. Speciﬁcally, DeepLab-LargeFOV is built by converting the ﬁrst two fully-connected layers of VGGnet-16 (i.e., fc6 and fc7) to convolutional layers (i.e., conv6 and conv7) and then
appending a 1 1 convolution layer (i.e., conv8) with 21 channels at the end of the convolutionalized networks for semantic segmentation on PASCAL VOC dataset. The stride of the last
two max-pooling layers (i.e., pool4 and pool5) is changed to 1,8 and the convolutional ﬁlter of
5 Training

these networks requires relatively more training data, time, and memory.
prediction for large objects are obtained from only local information. Thus, the pixels which belong to the same large
object may have inconsistent labels. Moreover, the pixels which belong to small objects may be classiﬁed as background.
7 A convolutionalized VGGnet-16 is obtained by converting all the fully-connected layers to convolution layers.
8 The stride of pool4 and pool5 layers in the original VGGnet-16 is 2.
6 Label

134

7. APPLICATIONS OF CNNS IN COMPUTER VISION

conv6 layer is replaced with an atrous convolution of kernel size 3 3 and an atrous sampling
factor d D 12. Therefore, the spatial size of the output class score prediction maps is increased
by a factor of 4. Finally, a fast bilinear interpolation layer9 (i.e., 8 up-sampling) is employed to
recover the output prediction maps at the original image size.
Atrous Spatial Pyramid Pooling (ASPP)
In order to capture objects and context at multiple scales, DeepLab employs multiple parallel
atrous convolutional layers (discussed in Section 4.2.2) with diﬀerent sampling factors d , which
is inspired by the success of the Spatial Pyramid Pooling (SPP) layer discussed in Section 4.2.8.
Speciﬁcally, DeepLab with atrous Spatial Pyramid Pooling layer (called DeepLab-ASPP) is
built by employing 4 parallel conv6 conv7 conv8 branches with 3 3 ﬁlters and diﬀerent
atrous factors (d D f6; 12; 18; 24g) in the conv6 layers, as shown in Fig. 7.9. The output prediction maps from all four parallel branches are aggregated to generate the ﬁnal class score maps.
Next, a fast bilinear interpolation layer is employed to recover the output prediction maps with
the original image size.
4

!
=1

!1
Conv8
(1×1)

!2
Conv8
(1×1)

!3
Conv8
(1×1)

!4
Conv8
(1×1)

Conv7
(1×1)

Conv6
(3×3, d = 6)

Conv6
(3×3, d = 12)

Conv6
(3×3, d = 18)

Conv6
(3×3, d = 24)

Pool5

Figure 7.9: Architecture of DeepLab-ASPP (DeepLab with Atrous Spatial Pyramid Pooling
layer). pool5 denotes the output of the last pooling layer of VGGnet-16.

9 Instead

of using transposed convolutional layers as in FCN and DDN, DeepLab uses a fast bilinear interpolation layer
without training its parameters. This is because, unlike FCN and DDN, the output class score prediction maps of DeepLab
(i.e., the output maps of conv8 layer) are quite smooth and thus a single up-sampling step can eﬃciently recover the output
prediction maps with the same size as the input image

7.4. SCENE UNDERSTANDING

135

DeepLab Inference and Training
The output prediction maps of the bilinear interpolation layer can only predict the presence
and rough position of objects, but the boundaries of objects cannot be recovered. DeepLab
handles this issue by combining the fully connected CRF [Krähenbühl and Koltun, 2011] with
the output class prediction maps of the bilinear interpolation layer. The same approach is also
used in DDN. During training, the Deep Convolutional Neural Network (DCNN) and the
CRF training stages are decoupled. Speciﬁcally, a cross validation of the fully connected CRF
is performed after the ﬁne-tuning of the convolutionalized VGGnet-16 network.
DeepLab Testing
DeepLab has been tested on the PASCAL VOC 2012 validation set and has been shown to
achieve state-of-the-art performance of 71.6% mean IoU compared to other reported methods including fully convolutional networks. Moreover, DeepLab with atrous Spatial Pyramid
Pooling layer (DeepLab-ASPP) achieved about 2% higher accuracy than DeepLab-LargeFOV.
Experimental results show that DeepLab based on ResNet-101 delivers better segmentation
results compared to Deep Lab employing VGGnet-16.

7.4

SCENE UNDERSTANDING

In computer vision, the recognition of single or isolated objects in a scene has achieved signiﬁcant
success. However, developing a higher level of visual scene understanding requires more complex
reasoning about individual objects, their 3D layout and mutual relationships [Khan, 2016, Li
et al., 2009]. In this section, we will discuss how CNNs have been used in the area of scene
understanding.

7.4.1 DEEPCONTEXT
DeepContext [Zhang et al., 2016] presents an approach to embed 3D context into the topology
of a neural network, trained to perform holistic scene understanding. Given an input RGBD image, the network can simultaneously make global predictions (e.g., scene category and
3D scene layout) as well as local decisions (e.g., the position and category of each constituent
object in the 3D space). This method works by ﬁrst learning a set of scene templates from
the training data, which encodes the possible locations of single or multiple instances of objects
belonging to a speciﬁc category. Four diﬀerent scene templates, including sleeping area, lounging
area, oﬃce area, and a table and chair sets are used. Given this contextual scene representation,
DeepContext matches an input volumetric representation of an RGBD image with one of the
scene templates using a CNN (Module B, Fig. 7.10). Afterward, the input scene is aligned
with the scene template using a transformation network (Module C, Fig. 7.10). The aligned
volumetric input is fed to a deep CNN with two main branches; one works on the complete
3D input and obtains a global feature representation while the second works on the local object

136

7. APPLICATIONS OF CNNS IN COMPUTER VISION

level and predicts the location and existence of each potential object in the aligned template
(Module D, Fig. 7.10).

Module A

Scene
Classiﬁcation

Transformation
Network

Module B

Module C

3D Context
Network

Module D

Output:
3D
Understanding

Module E

Figure 7.10: The block diagram for DeepContext processing pipeline. Given a 3D volumetric input (module A), the transformation network (module C) aligns the input data with its corresponding scene template
(estimated by module B). Using this roughly aligned scene, the 3D context network (module D) estimates the
existence of an object and adjusts the object location based on local object features and holistic scene features,
to understand a 3D scene.

The DeepContext algorithm follows a hierarchical process for scene understanding which
is discussed below.
Learning Scene Templates
The layouts of the scene templates (e.g., oﬃce, sleeping area) are learned from the SUN RGBD dataset [Song et al., 2015], which comes with 3D object bounding box annotations. Each
template represents a scene context by summarizing the bounding box locations and the category
information of the objects that are present in the training set. As an initial step, all the clean
examples of each scene template (i.e., sleeping area, lounging area, oﬃce area and a table and
chair sets) are identiﬁed. Next, a major object is manually identiﬁed in each scene template (e.g.,
a bed in the sleeping area) and its position is used to align all the scenes belonging to a speciﬁc
category. This rough alignment is used to ﬁnd the most frequent locations for each object (also
called “anchor positions”) by performing k-means clustering and choosing the top k centers for
each object. Note that the object set includes not only the regular objects (e.g., bed, desk) but
also the scene elements which deﬁne the room layout (e.g., walls, ﬂoor and ceiling).
The clean dataset of the scene categories used to learn the scene templates, is also used
in subsequent processing stages, such as scene classiﬁcation, scene alignment, and 3D object
detection. We will discuss these stages below.
Scene Classiﬁcation (Module B, Fig. 7.10)
A global CNN is trained to classify an input image to one of the scene templates. Its architecture
is exactly same as the global scene pathway in Fig. 7.11. Note that the input to the network
is a 3D volumetric representation of the input RGB-D images, which is obtained using the
Truncated Signed Distance Function (TSDF) [Song and Xiao, 2016]. This ﬁrst representation is

7.4. SCENE UNDERSTANDING

137

processed using three processing blocks, each consisting of a 3D convolution layer, a 3D pooling
layer, and a ReLU nonlinearity. The output is an intermediate “spatial feature” representation
corresponding to a grid of 3D spatial locations in the input. It is further processed by two fully
connected layers to obtain a “global feature” which is used to predict the scene category. Note
that both the local spatial feature and the global scene-level feature is later used in the 3D context
network.

Object Feature 2

Object Feature k

FC ×2

Object Exist/Reg

Object Feature 1

Object Exist/Reg

ReLU

Concatenation

Anchor Position k

FC ×2

Scene Classiﬁcation

3D Pooling

Anchor Position 2

Global FeatureVector

×2

Anchor Position 1

3D Conv

ReLU

3D Pooling ×3

3D Conv

Spatial Feature

3D RoI Pooling

Aligned Template

3D Input

Global Scene Pathway

Object Exist/Reg

Object Pathway

Figure 7.11: Deep 3D Context network (Module D, Fig. 7.10) architecture. The network consists of two channels for global scene-level recognition and local object-level detection. The scene
pathway is supervised with the scene classiﬁcation task during pre-training only (Module B). The
object pathway performs object detection, i.e., predicts the existence/non-existence of an object
and regresses its location. Note that the object pathway brings in both local and global features
from the scene pathway.

3D Transformation Network (Module C, Fig. 7.10)
Once the corresponding scene template category has been identiﬁed, a 3D transformation network estimates a global transformation which aligns the input scene to the corresponding scenetemplate. The transformation is calculated in two steps: a rotation followed by a translation. Both
these transformation steps are implemented individually as classiﬁcation problems, for which
CNNs are well suited.
For the rotation, only the rotation about the vertical-axis (yaw) is predicted since the
gravity direction is known for each scene in the SUN RGB-D dataset. Since an exact estimation of the rotation along the vertical axis is not required, the 360ı angular range is divided into
36 regions, each encompassing 10ı . A 3D CNN is trained to predict the angle of rotation along

138

7. APPLICATIONS OF CNNS IN COMPUTER VISION

the vertical axis. The CNN has the same architecture as the one used in Module B (scene classiﬁcation), however, its output layer has 36 units which predicts one of the 36 regions denoting
the y-axis rotation.
Once the rotation has been applied, another 3D CNN is used to estimate the translation
that is required to align the major object (e.g., bed, desk) in the input scene and the identiﬁed
scene template. Again, the CNN has essentially the same architecture as Module B, however,
the last layer is replaced by a soft-max layer with 726 units. Each value of the output units denotes a translation in a discretized space of 11 11 6 values. Similar to rotation, the estimated
translation is also a rough match due to the discretized set of values. Note that for such problems
(i.e., parameter estimation), regression is a natural choice since it avoids errors due to discretization. However, the authors could not successfully train a CNN with a regression loss for this
problem. Since the context network regresses the locations of each detected object in the scenes
in the next stage, a rough alignment suﬃces at this stage. We explain the context network below.
3D Context Network (Fig. 7.11)
The context neural network performs 3D object detection and layout estimation. A separate network is trained for each scene template category. This network has two main branches, as shown
in Fig. 7.11: a global scene level branch and a local object level branch. Both network pathways
encode diﬀerent levels of details about the 3D scene input, which are complementary in nature.
The local object-level branch is dependent on the features from the initial and ﬁnal layers of the
global scene-level branch. To avoid any optimization problems, the global scene-level branch
is initialized with the weights of the converged scene classiﬁcation network (Module B) (since
both have the same architecture). Each category-speciﬁc context network is then trained separately using only the data from that speciﬁc scene template. During this training procedure, the
scene-level branch is ﬁne-tuned while the object-level branch is trained from scratch.
The object-level branch operates on the spatial feature from the global scene-level branch.
The spatial feature is the output activation map after the initial set of three processing layers, each
consisting of a 3D convolution, pooling, and ReLU layers. This feature map is used to calculate object-level features (corresponding to anchor positions) at a 6 6 6 resolution using the
3D Region of Interest (RoI) pooling. The 3D RoI pooling is identical to its 2D counterpart
described in Section 4.2.7, with only an extra depth dimension. The pooled feature is then processed through 3D convolution and fully connected layers to predict the object existence and
its location (3D bounding box). The object location is regressed using the R-CNN localization loss to minimize the oﬀset between the ground-truth and the predicted bounding boxes
(Section 7.2.1).
Hybrid Data for Pre-training
Due to the lack of huge amount of RGB-D training data for scene understanding, this approach
uses an augmented dataset for training. In contrast to the simple data augmentation approaches

7.4. SCENE UNDERSTANDING

139

we discussed in Section 5.2.1, the proposed approach is more involved. Speciﬁcally, a hybrid
training set is generated by replacing the annotated objects from the SUN RGB-D dataset with
the same category CAD models. The resulting hybrid set is 1,000 times bigger than the original
RGBD training set. For the training of 3D context network (scene pathway), the models are
trained on this large hybrid dataset ﬁrst, followed by a ﬁne-tuning on the real RGB-D depth
maps to ensure that the training converges. For the alignment network, the pre-trained scene
pathway from the 3D context network is used for the initialization. Therefore, the alignment
network also beneﬁts from the hybrid data.
The DeepContext model has been evaluated on the SUN RGB-D dataset and has been
shown to model scene contexts adequately.

7.4.2 LEARNING RICH FEATURES FROM RGB-D IMAGES
The object and scene level reasoning system presented in the previous section was trained end-toend on 3D images. Here, we present an RGB-D- (2.5D instead of 3D) based approach [Gupta
et al., 2014] which performs object detection, object instance segmentation, and semantic segmentation, as shown in Fig. 7.12. This approach is not end-to-end trainable, it rather extends
the pre-trained CNN on color images to depth images by introducing a new depth encoding
method. This framework is interesting, since it demonstrates how pre-trained networks can be
eﬀectively used to transfer learning from domains where large quantities of data is available to
the ones where labeled data is scarce, and even for new data modalities (e.g., depth in this case).
In the following sections, we brieﬂy discuss the processing pipeline (summarized in Fig. 7.12).
Encoding Depth Images for Feature Learning
Instead of directly training a CNN on the depth images, Gupta et al. [2014] propose to encode
the depth information using three geometric features calculated at each pixel location. These
include: the horizontal disparity, the height above the ground, and the angle between the surface
normals at a given pixel and the estimated gravity direction. This encoding is termed as the HHA
encoding (ﬁrst letter of each geometric feature). For the training dataset, all these channels are
mapped to a constant range of 0–255.
The HHA encoding and color images are then fed to the deep network to learn more
discriminative feature representations on top of the geometric features based on raw depth information. Since the NYU-Depth dataset used in this work, consisted of only 400 images, an
explicit encoding of the geometric properties of the data made a big diﬀerence to the network,
which usually requires much larger amounts of data to automatically learn the best feature representations. Furthermore, the training dataset was extended by rendering synthetic CAD models
in the NYU-Depth dataset scenes. This is consistent with the data augmentation approach that
we discussed for the case of DeepContext [Zhang et al., 2016].

140

7. APPLICATIONS OF CNNS IN COMPUTER VISION

Input

Ouput
Geometric Encoding
of Depth

RGB

Depth Image

Contour Detection

Regional Proposal
Generation

Disparity Height Angle

SVM
Classiﬁer

Object Detection

Depth CNN
Features
Extraction

RGB CNN
Features
Extraction
RGB

Instance Segmentation

Figure 7.12: The input to this framework is a RGB and a depth image. First, object proposals
are generated using the contour information. The color and the encoded depth image are passed
through separately trained CNNs to obtain features which are then classiﬁed into object categories using the SVM classiﬁers. After detection, a Random Forest classiﬁer is used to identify
foreground object segmentation within each valid detection.
CNN Fine-tuning for Feature Learning
Since the main goal of this work is object detection, it is useful to work on the object proposals. The region proposals are obtained using an improved version of Multiscale Combinatorial
Grouping (MCG) [Arbeláez et al., 2014] approach which incorporated additional geometric
features based on the depth information. The regions proposals which have a high overlap with
the ground-truth object bounding box are ﬁrst used to train a deep CNN model for classiﬁcation. Similar to R-CNN, the pre-trained AlexNet [Krizhevsky et al., 2012] is ﬁne-tuned for
this object classiﬁcation task. Once the network is ﬁne-tuned, object speciﬁc linear classiﬁers
(SVMs) are trained using the intermediate CNN features for the object detection task.
Instance Segmentation
Once the object detections are available, the pixels belonging to each object instance are labeled.
This problem is tackled by predicting the foreground or background label for each pixel within
a valid detection. For this purpose, a random forest classiﬁer is used to provide the pixel level
labelings. This classiﬁer is trained on the local hand-crafted features whose details are available
in Gupta et al. [2013]. Since these predictions are calculated roughly for individual pixels, they
can be noisy. To smooth the initial predictions of the random forest classiﬁer, these predictions
are averaged on each super-pixel. Note that subsequent works on instance segmentation (e.g.,
He et al. [2017]) have incorporated a similar pipeline (i.e., ﬁrst detecting object bounding boxes
and then predicting the foreground mask to label the individual object instance). However, in

7.5. IMAGE GENERATION

141

contrast to Gupta et al. [2014], He et al. [2017] use an end to end trainable CNN model which
avoids manual parameter selection and a series of isolated processing steps toward the ﬁnal goal.
As a result, He et al. [2017] achieve highly accurate segmentations compared to the non-endto-end trainable approaches.
Another limitation of this approach is the separate processing of both color and depth
images. As we discussed in Chapter 6, AlexNet has a huge number of parameters, and learning
two separate set of parameters for both modalities doubles the parameter space. Moreover, since
both images belong to the same scene, we expect to learn better cross-modality relationships if
both modalities are considered jointly. One approach to learn a shared set of parameters in such
cases is to stack the two modalities in the form of a multichannel input (e.g., six channels) and
perform a joint training over both modalities [Khan et al., 2017c, Zagoruyko and Komodakis,
2015].

7.4.3 POINTNET FOR SCENE UNDERSTANDING
PointNet (discussed in Section 7.1.1) has also been used for scene understanding by assigning
a semantically meaningful category label to each pixel in an image (Module B in Figure 7.1).
While we have discussed the details of PointNet before, it is interesting to note its similarity
with the Context Network in DeepContext (Section 7.4.1 and Figure 7.11). Both these networks
learn an initial representation, shared among the global (scene classiﬁcation) and the local (semantic segmentation or object detection) tasks. Afterward, the global and local branches split
up, and the scene context is added in the local branch by copying the high-level features from
the global branch to the local one. One can notice that the incorporation of both the global
and local contexts are essential for a successful semantic labeling scheme. Other recent works in
scene segmentation are also built on similar ideas, i.e., better integration of scene context using
for example a pyramid-based feature description [Zhao et al., 2017], dilated convolutions [Yu
and Koltun, 2015], or a CRF model [Khan et al., 2016a, Zheng et al., 2015].

7.5

IMAGE GENERATION

Recent advances in image modeling with neural networks, such as Generative Adversarial Networks (GANs) [Goodfellow et al., 2014], have made it feasible to generate photo-realistic images, which can capture the high-level structure of the natural training data [van den Oord et al.,
2016]. GANs are one type of generative networks, which can learn to produce realistic-looking
images in an unsupervised manner. In recent years, a number of GAN-based image generation
methods have emerged which work quite well. One such promising approach is Deep Convolutional Generative Adversarial Networks (DCGANs) [Radford et al., 2015], which generates
photo-realistic images by passing random noise through a deep convolutional network. Another
interesting approach is Super-Resolution Generative Adversarial Networks (SRGAN) [Ledig
et al., 2016], which generates high-resolution images from low-resolution counterparts. In this

142

7. APPLICATIONS OF CNNS IN COMPUTER VISION

section, we shall ﬁrst brieﬂy discuss GANs and then extend our discussion to DCGANs and
SRGAN.

7.5.1 GENERATIVE ADVERSARIAL NETWORKS (GANS)
GANs were ﬁrst introduced by Goodfellow et al. [2014]. The main idea behind a GAN is to have
two competing neural network models (shown in Fig. 7.13). The ﬁrst model is called generator,
which takes noise as input and generates samples. The other neural network, called discriminator,
receives samples from both the generator (i.e., fake data) and the training data (i.e., real data),
and discriminates between the two sources. These two networks undergo a continuous learning
process, where the generator learns to produce more realistic samples, and the discriminator
learns to get better at distinguishing generated data from real data. These two networks are
trained simultaneously with the aim that this training will drive the generated samples to be
indistinguishable from real data. One of the advantages of GANs is that they can back-propagate
the gradient information from the discriminator back to the generator network. The generator,
therefore, knows how to adapt its parameters in order to produce output data that can fool the
discriminator.

Data Sample

Noise

Generator

Discriminator

Generator
Sample

Data Sample?

Yes/No

Figure 7.13: Generative Adversarial Networks Overview. Generator takes the noise as an input
and generates samples. The discriminator distinguishes between the generator samples and the
training data.

GANs Taining
The training of GANs involves the computation of two loss functions, one for the generator and
one for the discriminator. The loss function for the generator ensures that it produces better data
samples, while the loss function for the discriminator ensures that it distinguishes between the
generated and real samples. We now brieﬂy discuss these loss functions. For more details, the
readers are referred to Goodfellow [2016].

7.5. IMAGE GENERATION

143

Discriminator Loss Function
The discriminator’s loss function J .D/ is represented by:

J .D/ . .D/ ; .G/ / D

1
Expdat a Œlog D.x/
2

1
Ezp.z/ Œlog.1
2

D.G.z///;

(7.5)

which is the cross entropy loss function. In this equation, .D/ and .G/ are the parameters of
the discriminator and the generator networks, respectively. pdat a denotes the distribution of
real data, x is a sample from pdata , p.z/ is the distribution of the generator, z is a sample from
p.z/ , G.z/ is the generator network and D is the discriminator network. One can note from
Eq. (7.5) that the discriminator is trained as a binary classiﬁer (with sigmoid output) on two
mini-batches of data. One of the them is from the data set containing real data samples with
label 1 assigned to all examples, and the other from the generator (i.e., fake data) with label 0
for all the examples.
Cross-entropy: The cross-entropy loss function for the binary classiﬁcation
task is deﬁned as follows:
H..x1 ; y1 /; D/ D

y1 log D.x1 /

y1 / log.1

D.x1 //;

(7.6)

where x1 and y1 2 Œ 1 1 denote a sample from a probability distribution
function D and its desired output, respectively. After summing over m data
samples, Eq. (7.5) can be written as follows:
H..xi ; yi /m
i D1 ; D/

m
X
i D1

yi log D.xi /

m
X
.1

yi / log.1

D.xi //:

i D1

(7.7)
In the case of GANs, data samples come from two sources, the discriminator’s distribution xi pdata or the generator’s distribution xi D G.z/,
where z p.z/ . Let’s assume that the number of samples from both distributions is equal. By writing Eq. (7.7) probabilistically, i.e., replacing the
sums with expectations, the label yi with 21 (because the number of samples from both generator and discriminator distributions is equal), and the
log.1 D.xi // with log.1 D.G.z///, we get the same loss function as
Eq. (7.5), for the discriminator.

Generator Loss Function
The discriminator, discussed above, distinguishes between the two classes, i.e., the real and the
fake data, and therefore needs the cross entropy function, which is the best option for such tasks.
However, in the case of the generator, the following three types of loss functions can be used.

144

7. APPLICATIONS OF CNNS IN COMPUTER VISION

• The minimax Loss function: The minimax loss10 is the simplest version of the loss function, which is represented as follows:
J .G/ D

J .D/ D

1
1
Expdata Œlog D.x/ C Ezp.z/ Œlog.1
2
2

D.G.z///:

(7.8)

The minimax version has been found to be less useful due to the gradient saturation problem. The latter occurs due to the poor design of the generator’s loss function. Speciﬁcally,
as shown in Fig. 7.14, when the generator samples are successfully rejected by the discriminator with high conﬁdence11 (i.e., D.G.z// is close to zero), the generator’s gradient
vanishes and thus, the generator’s network cannot learn anything.
• Heuristic, non-saturation loss function: The heuristic version is represented as follows:
J .G/ D

1
Ez Œlog D.G.z//:
2

(7.9)

This version of the loss function is based on the concept that the gradient of the generator
is only dependent on the second term in Eq. (7.5). Therefore, as opposed to the minimax
function, where the signs of J .D/ are changed, in this case the target is changed, i.e.,
log D.G.z// is used instead of log.1 D.G.z///. The advantage of this strategy is that
the generator gets a strong gradient signal at the start of the training process (as shown
in Fig. 7.14), which helps it in attaining a fast improvement to generate better data, e.g.,
images.
• Maximum Likelihood loss function: As the name indicates, this version of the loss function is motivated by the concept of maximum likelihood (a well-known approach in machine learning) and can be written as follows:
J .G/ D

1
Ez Œexp .
2

.D.G.z////;

(7.10)

where is the logistic sigmoid function. Like the minimax loss function, the maximum
likelihood loss also suﬀers from the gradient vanishing problem when D.G.z// is close to
zero, as shown in Fig. 7.14. Moreover, unlike the minimax and the heuristic loss functions,
the maximum likelihood loss, as a function of D.G.z//, has a very high variance, which is
problematic. This is because most of the gradient comes from a very small number of the
generator’s samples that are most likely to be real rather than fake.
10 In

the minimax game, there are two players (e.g., generator and discriminator) and in all states the reward of player 1 is the
negative of reward of player 2. Speciﬁcally, the discriminator minimizes a cross-entropy, but the generator maximizes the
same cross-entropy.
11 At the start of the training, the generator likely produces random samples, which are quite diﬀerent from the real samples,
so the discriminator can easily classify the real and fake samples.

7.5. IMAGE GENERATION

145

J(G)

0
-5
-10
-15
-20
0.0

Minimax
Non-saturating Heuristic
Maximum Likelihood Cost
0.2

0.4

0.6

0.8

1.0

D(G(𝒛))

Figure 7.14: The loss response curves as functions of D.G.z// for three diﬀerent variants of the
GAN generator’s loss functions.
To summarize, one can note that all three generator loss functions do not depend on the
real data (x in Eq. (7.5)). This is advantageous, because the generator cannot copy input data
x , which helps in avoiding the over-ﬁtting problem in the generator. With this brief overview
of generative adversarial models and their loss functions, we will now summarize the diﬀerent
steps that are involved in GANs training.
1. sampling a mini-batch of m samples from the real dataset pdata ;
2. sampling a mini-batch of m samples from the generator p.z/ (i.e., fake samples);
3. learning the discriminator by minimizing its loss function in Eq. (7.5);
4. sampling a mini-batch of m samples from the generator p.z/ (i.e., fake sample); and
5. learning the generator by minimizing its loss function in Eq. (7.9).
These steps are repeated until convergence occurs or until the iteration is terminated. With
this brief overview of generative adversarial networks, we will now discuss two representative
applications of GANs, known as DCGANs [Radford et al., 2015] and SRGAN [Ledig et al.,
2016].

7.5.2

DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL
NETWORKS (DCGANS)
DCGANs [Radford et al., 2015] are the ﬁrst GAN models, which can generate realistic images
from random input samples. In the following, we shall discuss the architecture and the training
of DCGANs.

146

7. APPLICATIONS OF CNNS IN COMPUTER VISION

Architecture of DCGANs
DCGANs provide a family of CNN architectures, which can be used in the generator and the
discriminator parts of GANs. The overall architecture is the same as the baseline GAN architecture, which is shown in Fig. 7.13. However, the architecture of both the generator (shown
in Fig. 7.15) and the discriminator (shown in Fig. 7.16) is borrowed from the all-convolutional
network [Springenberg, 2015], meaning that these models do not contain any pooling or unpooling layers. In addition, the generator uses transposed convolution to increase the representation’s spatial size, as illustrated in Fig. 7.15. Unlike the generator, the discriminator uses
convolution to squeeze the representation’s spatial size for the classiﬁcation task (Fig. 7.16). For
both networks, a stride greater than 1 (usually 2) is used in the convolutional and transposed
convolutional layers. Batch normalization is used in all layers of both the discriminator and the
generator models of DCGANs with the exception of the ﬁrst layer of the discriminator and
the last layer of the generator. This is done to ensure that DGCAN learns the correct scale of
the data distribution and its mean. DCGANs use the ReLU activation in all transposed convolutional layers except the output layer, which uses tanh activation function. It is because a
bounded activation function such as tanh, allows the generator to speed-up the learning process
and cover the color space of the samples from the training distribution. For the discriminator
model, leaky ReLU (discussed in Section 4.2.4) was found to be better than ReLU. DCGANs
use the Adam optimizer (discussed in Section 5.4.6) rather than SGD with momentum.
3

128
256
512

100-D

Input

1024

16
Z

4
Project
4
and
Reshape

8
8

tconv1

Stride 2

tconv2

Stride 2

tconv3

tconv4

Figure 7.15: Example DCGAN generator’s architecture for Large-scale Scene Understanding
(LSUN) dataset [Song and Xiao, 2015]. A 100-dimensional uniform distribution z is projected
and then reshaped to a 4 4 1,024 tensor, where 1,024 is the number of feature maps. Next, the
tensor is passed through a sequence of transposed convolutional layers (i.e., tconv1 tconv2
tconv3 tconv4) to generate a 64 64 3 realistic image.

7.5. IMAGE GENERATION

147

3
64

Stride 2

Stride 2
32

Stride 2
16

conv1

conv2

Stride 2

conv3

conv4

Loss Function

512

FC(2)

256
64

Reshape to Vector

128

Input

Figure 7.16: Example DCGAN discriminator’s architecture for LSUN dataset [Song and Xiao,
2015]. An 64 64 RGB input image is passed through a sequence of convolutional layers (i.e.,
conv1 conv2 conv3 conv4) followed by a fully-connected layer with 2 outputs.

DCGANs as a Feature Extractor
DCGANs uses ImageNet as a dataset of natural images for unsupervised training to evaluate
the quality of the features learned by DCGAN. For this purpose, DCGANs is ﬁrst trained on
ImageNet dataset. Note that the image labels are not required during training (unsupervised
learning). Also, it is important to note that all the training and test images are scaled to the
range of [-1, 1], i.e., the range of tanh activation function. No other pre-processing is done.
Then, for each training image in a supervised dataset (e.g., CIFAR-10 dataset), the output feature maps from all the convolutional layers of the discriminator are max-pooled to produce a
4 4 spatial grid for each layer. These spatial grids are then ﬂattened and concatenated to form
a high dimensional feature representation of the image. Finally, an `2 regularized linear SVM is
trained on the high dimensional feature representation of all the training images in the supervised dataset. It is interesting to note that although DCGAN is not trained on the supervised
dataset, it outperforms all K-means-based methods. DCGANs can also be used for image generation. For example, experiments show that the output of the last layer in the generator network
of DCGAN trained on LSUN dataset [Song and Xiao, 2015] produces really cool bedroom images.

7.5.3

SUPER RESOLUTION GENERATIVE ADVERSARIAL NETWORK
(SRGAN)
In Ledig et al. [2016], a generative adversarial network for single SRGAN is presented, where
the input to the network is a low-resolution (LR) image and the output is its high-resolution
(HR) counterpart. Unlike existing optimization-based super resolution techniques, which rely
on the minimization of the Mean Squared Error (MSE) as loss function, this technique proposes
the perceptual loss function for the generator. The latter is comprised of two losses which are

148

7. APPLICATIONS OF CNNS IN COMPUTER VISION

called “content loss” and “adversarial loss.” In the following, we will brieﬂy discuss these loss
functions, followed by the architecture of SRGAN.
Content Loss
The MSE loss function smooths images by suppressing the high-frequency contents, which
results in perceptually unsatisﬁed solutions [Ledig et al., 2016]. To overcome this problem, a
content loss function that is motivated by perceptual similarity is used in SRGAN:
SR
lCon

W H
1 XX
D
..I HR /x;y
W H xD1 yD1

.GG .I LR //x;y /2 ;

(7.11)

where I LR and I HR are the LR and HR images, ./ is the output feature map produced by
a convolutional layer of VGGnet-19, W and H are the width and the height of the feature
map, respectively. In summary, Eq. (7.11) computes the Euclidean distance between the output
feature maps (i.e., the output of a convolutional layer of the pre-trained VGGnet-19) of the
generated image (GG .I LR /) and the real high resolution image I HR . Note that the pre-trained
VGGnet-19 is only used as a feature extractor (i.e., its weight parameters are not changed during
the training of SRGAN).
Adversarial Loss
The adversarial loss is the same as the heuristic loss function in the baseline GAN (Eq. (7.9)),
and is deﬁned as follows:
SR
lAdv
D

log DD .GG .I LR //;

(7.12)

where DD .GG .I LR // is the probability that the image produced by the generator, GG .I LR /,
is a HR image (i.e., real image).
Perceptual Loss Function as the Generator Loss
The perceptual loss function, which is used as the generator loss in SRGAN, is calculated as the
weighted sum of the content loss and the adversarial loss, discussed above and is given by:
SR
l SR D lCon=i;j
C 10

3 SR
lAdv ;

(7.13)

SR
SR
is the adversarial loss.
represents the content loss, while lGen
where lCon=i;j

Discriminator Loss
The discriminator loss is the cross-entropy loss function (Eq. (7.5)), which is trained as a binary
classiﬁer (HR or LR class) with a sigmoid output.

7.5. IMAGE GENERATION

149

SRGAN Architecture
Similar to the baseline GAN [Goodfellow et al., 2014], SRGAN has two key components:
the discriminator and the generator. We will now brieﬂy discuss the architecture of these two
SRGAN components.
Discriminator Network of SRGAN
The discriminator network of SRGAN, shown in Fig. 7.17, is inspired by the architecture of
DCGAN (discussed in Section 7.5.2). The network consists of eight convolutional layers with
3 3 convolution kernels followed by two fully connected layers, and a sigmoid function to
perform binary classiﬁcation.

Sigmoid

Dense (1)

Leaky ReLU

Conv

Dense (1024)

k3n128s1 k3n128s2 k3n256s1 k3n256s2 k3n512s1 k3n512s2

k3n64s2

Leaky ReLU

Input

Conv

k3n64s1

Figure 7.17: Architecture of the discriminator network of SRGAN. k , n, and s represent the
kernel size, number of feature maps, and stride for each convolutional layer.

Generator Network of SRGAN
The generator component of the SRGAN, shown in Fig. 7.18, is inspired by the deep residual
network (discussed in Section 6.6) and the architecture of DCGAN (discussed in Section 7.5.2).
As suggested by DCGAN [Radford et al., 2015], a leaky ReLU activation function is used in all
layers. Moreover, batch normalization is used after all convolutional layers with the exception of
the ﬁrst convolutional layer.
B Residual Blocks

Conv

k9n3s1

PReLU

Conv

k3n256s1
Elementwise Sum

Conv

k3n64s1
Elementwise Sum

k3n64s2

PReLU

k3n64s1

Conv
BN

PReLU

Input
Low Res
Image

Conv

k9n64s1

Skip Connection

Figure 7.18: Architecture of the generator network of SRGAN. Similar to the discriminator
network, k , n, and s represent the kernel size, number of feature maps, and stride for each
convolutional layer.

150

7. APPLICATIONS OF CNNS IN COMPUTER VISION

In summary, SRGAN is able to estimate high resolution images with photo-realistic textures from heavily down-sampled images. It achieved a very good performance on three publicly
available datasets including Set5, Set14, and BSD100.

7.6

VIDEO-BASED ACTION RECOGNITION

Human action recognition in videos is a challenging research problem, which has received a signiﬁcant amount of attention in the computer vision community [Donahue et al., 2015, Karpathy
et al., 2014, Rahmani and Bennamoun, 2017, Rahmani and Mian, 2016, Rahmani et al., 2017,
Simonyan and Zisserman, 2014a]. Action recognition aims to enable computers to automatically
recognize human actions from real world videos. Compared to the single-image classiﬁcation,
the temporal extent of action videos provides an additional information for action recognition.
Inspired by this, several approaches have been proposed to extend state-of-the-art image classiﬁcation CNNs (e.g., VGGnet, ResNet) for action recognition from video data. In this section, we
shall brieﬂy discuss three representative CNN-based architectures used for video-based human
action recognition task.

7.6.1 ACTION RECOGNITION FROM STILL VIDEO FRAMES
CNNs have so far achieved promising image recognition results. Inspired by this, an extensive
evaluation of CNNs for extending the connectivity of a CNN to the temporal domain for the
task of large-scale action recognition is provided in Karpathy et al. [2014]. We shall now discuss
diﬀerent architectures for encoding temporal variations of action videos.
Single Frame Architecture
We discuss a single-frame baseline architecture, shown in Fig. 7.19a, to analyze the
contribution of the static appearance to the classiﬁcation accuracy. The single frame
model is similar to the AlexNet [Krizhevsky et al., 2012], which won the ImageNet challenge. However, instead of accepting the original input of size 224 224
3, the network takes 170 170 3 sized image. This network has the following conﬁguration: Covn.96; 11; 3/ N P Conv.256; 5; 1/ N P Conv.384; 3; 1/
Conv.384; 3; 1/ Conv.256; 3; 1/ P FC.4096/ FC.4096/, where Conv.f; s; t / represents a
convolutional layer, which has ﬁlters f of spatial size s s and an input stride t . A fully connected layer, with n nodes, is represented by FC.n/. For the pooling layers P and all the normalization layers N the architectural details described in Krizhevsky et al. [2012] are used along
with the following parameters: k = 2; n = 5; ˛ = 10 4 ; ˇ = 0.5, where the constants k , n, ˛ , and ˇ
are hyper-parameters. A softmax layer is connected to the last fully connected layer with dense
connections.
Given an entire action video, the video-level prediction is produced by forward propagating each frame individually through the network and then averaging individual frame predictions over the durations of the video.

7.6. VIDEO-BASED ACTION RECOGNITION

Single Frame

Early Fusion

Late Fusion

Slow Fusion

(a)

(b)

(c)

(d)

151

Figure 7.19: Approaches for fusing information over the temporal dimension through the network. (a) Single Frame, (b) Early Fusion, (c) Late Fusion, and (d) Slow Fusion. In the Slow
Fusion Model (d), the depicted columns share parameters. The pink, green, and blue boxes denote convolutional, normalization and pooling layers, respectively.
Early Fusion Architecture
We now discuss the Early Fusion model (Fig. 7.19b). This model captures an entire time window
information and combines it across at a pixel level. To achieve this, the ﬁlters on the ﬁrst Conv
layer in the single frame network (described above) are modiﬁed. The new ﬁlters have a size of
11 11 3 T pixels, where T deﬁnes the temporal extent and is set to 10. This direct and early
connectivity to the pixel data helps this model to accurately detect the speed and the direction
of the local motion.
For a given entire video, 20 randomly selected sample clips are individually passed through
the network and their class predictions are then averaged to produce video-level action class
prediction.
Late Fusion Architecture
The late fusion model (Fig. 7.19c) consists of two separate single frame networks (described
above) with shared parameters up to the last Conv layer, Conv.256; 3; 1/. The outputs of the
last Conv layer of these two separate single frame networks are then merged in the ﬁrst fully
connected layer. Global motion characteristics are computed by the ﬁrst fully connected layer by
comparing the output of both single-frame networks. These two separate single frame networks
are placed at a distance of 15 frames apart. More precisely, the input to the ﬁrst and the second
single frame networks is the i -th frame and i C 15-th frame, respectively.

152

7. APPLICATIONS OF CNNS IN COMPUTER VISION

Slow Fusion Architecture
This model (Fig. 7.19d) slowly fuses the temporal information throughout the network in such
way that the higher layers have access to more global information in both the temporal and
the spatial domains. This is achieved by performing a temporal convolution along with a spatial
convolution to calculate the weights and by extending the connectivity of all convolutional layers
in time. More precisely, as shown in Fig. 7.19d, every ﬁlter in the ﬁrst convolutional layer is
applied on an input clip, with a size of 10 frames. The temporal extent of each ﬁlter is T D 4
and the stride is equal to 2. Thus, 4 responses are produced for each video clip. This process is
iterated by the second and third layers with ﬁlters of temporal extent T D 2 and a stride equal
to 2. Therefore, the information across all the input frames (a total of 10) can be accessed by the
third convolutional layer.
Given an entire human action video, video-level classiﬁcation is performed by passing 20
randomly selected sample clips through the network and then averaging individual clip predictions over the durations of the video.
Multi-resolution Architecture
In order to speed-up the above mentioned models while retaining their accuracy, the multiresolution architecture has been proposed in Karpathy et al. [2014]. The multi-resolution model
consists of two separate networks (i.e., fovea and context networks) over two spatial resolutions,
as shown in Fig 7.20. The architecture of the fovea and context networks is similar to the singleframe architecture discussed above. However, instead of accepting the original input of size 170
170 3, these networks take 89 89 3 sized images. More precisely, the input to the fovea
model is the center region of size 89 89 at the original spatial resolution, while for the context
stream the down-sampled frames at half the original resolution are used. The total dimensionality of the inputs is therefore halved. Moreover, the last pooling layer is removed from both the
fovea and context networks and the activation outputs of both networks are concatenated and
fed into the ﬁrst fully connected layer.
Model Comparison
All models were trained on the Sport-1M dataset [Simonyan and Zisserman, 2014a], which
consists of 200,000 test videos. The results showed that the variation among diﬀerent CNN architectures (e.g., Single Frame, Multi-resolution, Early, Late, and Slow Fusion) is surprisingly
insigniﬁcant. Moreover, the results were signiﬁcantly worse than the state-of-the-art handcrafted shallow models. One reason is that these models cannot capture the motion information
in many cases. For example, the Slow Fusion model is expected to implicitly learn the spatiotemporal features in its ﬁrst layers, which is a diﬃcult task [Simonyan and Zisserman, 2014a]. To
resolve this issue, two-stream CNNs [Simonyan and Zisserman, 2014a] model was proposed
to explicitly take into account both spatial and temporal information in a single end-to-end
learning framework.

7.6. VIDEO-BASED ACTION RECOGNITION

153

Fovea Str
ea

27
384

384

256

256
96

tre
Context S

Figure 7.20: Multi-resolution architecture. Input frames are passed through two separate
streams: a context stream which models the low-resolution image and a fovea stream which
processes the high-resolution center crop image. The pink, green, and blue boxes denote convolutional, normalization, and pooling layers, respectively. Both streams converge to two fully
connected layers (yellow boxes).

7.6.2 TWO-STREAM CNNS
The two-stream CNNs model [Simonyan and Zisserman, 2014a] (shown in Fig. 7.21) uses
two separate spatial and temporal CNNs, which are then combined by late fusion. The spatial
network performs action recognition from single video frames, while the temporal network is
learned to recognize action from motion, i.e., dense optical ﬂow. The idea behind this two-stream
model is related to the fact that the human visual cortex contains two pathways for object and
motion recognition, i.e., the ventral stream performs object recognition and the dorsal stream
recognizes motion.
Spatial Stream CNN
The spatial stream CNN model is similar to the single frame model in Section 7.6.1. Given
an action video, each video frame is individually passed through the spatial model shown in
Fig. 7.21 and an action label is assigned to each frame. Note that the label of all frames belonging
to a given action video is the same as the label of the action video.

154

7. APPLICATIONS OF CNNS IN COMPUTER VISION

Spatial Stream CNN
C1 N1 P1

C2 N2 P2

C5 P5

FC6

FC7

FC8

Class
Scores
Fusion

Single Frame

Temporal Stream CNN
C1 N1 P1

C2 P2

C5 P5

FC6

FC7

FC8

Multiple Frame Optical Flow

Figure 7.21: The architecture of the two-stream CNNs for action classiﬁcation.
Temporal Stream CNN
Unlike the motion-aware CNN models introduced in Section 7.6.1 (e.g., Slow Fusion), which
use stacked single video frames as input, temporal stream CNN takes stacked optical ﬂow displacement ﬁelds between several consecutive frames as input to explicitly learn the temporal
feature. In the following, three variations of the optical ﬂow-based input are explained.

• Optical ﬂow stacking: The input of the temporal stream CNN is formed by stacking the
dense optical ﬂow of L consecutive frames (shown in Fig. 7.22). The optical ﬂow at point
.u; v/ in frame t is a 2D displacement vector (i.e., horizontal and vertical displacement),
which moves the point to the corresponding point in the next frame t C 1. Note that the
horizontal and vertical components of the dense optical ﬂow of a frame can be seen as
image channels. Thus, the stacked dense optical ﬂow of L consecutive frames forms an
input image of 2L channels, which are fed to the temporal stream CNN as input.
• Trajectory stacking: Unlike the optical ﬂow stacking method which samples the displacement vectors at the same location in L consecutive frames, the trajectory stacking method
represents motion in an input image of 2L channels by sampling L 2D points along the
motion trajectories [Wang et al., 2011a], as shown in Fig. 7.22.
• Bi-directional optical ﬂow stacking: Both optical ﬂow and trajectory stacking methods
operate on the forward optical ﬂow. The bi-directional optical ﬂow stacking method extends these methods by computing both forward and backward displacement optical ﬂow
ﬁelds. More precisely, motion information is encoded in an input image of 2L channels
by stacking L2 forward optical ﬂows between frames t and t C L2 and L2 backward optical
ﬂows between frames t L2 and t .

7.6. VIDEO-BASED ACTION RECOGNITION

Input volume channels
at point

155

Input volume channels
at point

Figure 7.22: Left: optical ﬂow stacking method; right: trajectory stacking method.
Architecture
The architecture of the two-stream CNNs model is shown in Fig. 7.21. The architecture of
the spatial and temporal CNN models is similar, except that the second normalization layer is
removed from the temporal CNN to reduce memory consumption. As shown in this ﬁgure,
a class score fusion is added to the end of the model to combine the softmax scores of both
the spatial and temporal models by late fusion. Diﬀerent approaches can be used for class score
fusion. However, experimental results showed that training a linear SVM classiﬁer on stacked
`2 -normalized softmax scores outperforms simple averaging.

7.6.3 LONG-TERM RECURRENT CONVOLUTIONAL NETWORK (LRCN)
Unlike the methods in Section 7.6.1 and Section 7.6.2 which learn CNN ﬁlters based on
a stack of a ﬁxed number of input frames, Long-term Recurrent Convolutional Network
(LRCN) [Donahue et al., 2015] is not constrained to ﬁxed length input frames and, thus, could
learn to recognize more complex action video. As shown in Fig. 7.23, in LRCN, individual
video frames are ﬁrst passed through CNN models with shared parameters and then connected
to a single-layer LSTM network (described in Chapter 3). More precisely, the LRCN model
combines a deep hierarchical visual feature extractor, which is a CNN feature extractor, with an
LSTM that can learn to recognize the temporal variations in an end-to-end fashion.
In this chapter, we discussed representative works, which use CNNs in computer vision.
In Table 7.1, we provide an overview of some of the other important CNN applications and
the most representative recent works, which have not been covered in detail in this book. In the
following chapter, we will discuss some prominent tools and libraries of CNN.

156

7. APPLICATIONS OF CNNS IN COMPUTER VISION

Table 7.1: Few recent/most representative applications of CNN, not discussed in this book (Continues.)
Applications Paper Title
• Deep Visual-Semantic Alignments for Generating Image Descriptions [KarpImage
athy and Fei-Fei, 2015]
captioning
• DenseCap: Fully Convolutional Localization Networks for Dense Captioning
[ Johnson et al., 2016]
3D recon• Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetstruction
ric CNN Regression [ Jackson et al., 2017]
• Semantic Scene Completion from a Single Depth Image [Song et al. 2017]
from a 2D
image
Contour/
• DeepContour: A Deep Convolutional Feature Learned by Positive Sharing
edge detecLoss for Contour Detection [Shen et al., 2015b]
tion
• Edge Detection Using Convolutional Neural Network [Wang, 2016]
• Reading Text in the Wild with Convolutional Neural Networks [ Jaderberg et
Text detecal., 2016]
tion and rec• End-to-End Text Recognition with Convolutional Neural Networks [Wang
ognition
et al., 2012]
Octree rep• O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analyresentation
sis [Wang et al., 2017]
• OctNet: Learning Deep 3D Representations at High Resolutions [Riegler et
for shape
al., 2016]
analysis
• DeepFace: Closing the Gap to Human-Level Performance in Face VeriﬁcaFace recogni- tion [Taigman et al., 2014]
tion
• FaceNet: A Uniﬁed Embedding for Face Recognition and Clustering [Schroﬀ
et al., 2015]
• Learning Depth from Single Monocular Images Using Deep Convolutional
Depth estiNeural Fields [Liu et al., 2016]
mation
• Deep Convolutional Neural Fields for Depth Estimation from a Single Image
[Liu et al., 2015]
• PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization [Kendall et al., 2015]
Pose estimation
• DeepPose: Human Pose Estimation via Deep Neural Networks [Toshev and
Szegedy, 2014]

7.6. VIDEO-BASED ACTION RECOGNITION

157

Table 7.1: (Continued.) Few recent/most representative applications of CNN, not discussed in
this book
• Hedged Deep Tracking [Qi et al., 2016b]
• Hierarchical Convolutional Features for Visual Tracking [Ma et al., 2015]
• Automatic Shadow Detection and Removal from a Single Image [Khan et al.,
2016a]
Shadow de• Shadow Optimization from Structured Deep Edge Detection [Shen et al.,
tection
2015a]
• Highlight Detection with Pairwise Deep Ranking for First-Person Video
Video sumSummarization [Yao et al., 2016]
marization
• Large-Scale Video Summarization Using Web-Image Priors [Khosla et al.,
2013]
• Multi-level Attention Networks for Visual Question Answering [Yu et al.,
Visual ques2017]
tion answer• Image Captioning and Visual Question Answering Based on Attributes and
ing
External Knowledge [Wu et al., 2017]
• DevNet: A Deep Event Network for multimedia event detection and evidence
Event detecrecounting [Gan et al., 2015]
• A Discriminative CNN Video Representation for Event Detection [Xu et al.,
tion
2015]
• Collaborative Index Embedding for Image Retrieval [Zhou et al., 2017]
Image re• Deep Semantic Ranking Based Hashing for Multi-Label Image Retrieval
trieval
[Zhao et al., 2015a]
• Recurrent Convolutional Network for Video-Based Person Reidentiﬁcation
Person
[McLaughlin et al., 2016]
re-identiﬁca• An Improved Deep Learning Architecture for Person Re-Identiﬁcation
tion
[Ahmed et al., 2015]
• Forest Change Detection in Incomplete Satellite Images with Deep Neural
Change deNetworks [Khan et al., 2017c]
tection
• Detecting Change for Multi-View, Long-Term Surface Inspection [Stent et
al., 2015]
Tracking

158

7. APPLICATIONS OF CNNS IN COMPUTER VISION

Visual Input

Visual Features

Sequence Learning

Predictions

CNN

LSTM

CNN

LSTM

CNN

LSTM

CNN

LSTM

CNN

LSTM

Figure 7.23: The architecture of the LRCN.

159

CHAPTER

Deep Learning Tools and
Libraries
There have been a lot of interest from academics (e.g., The University of California Berkeley,
New York University, The University of Toronto, The University of Montreal) and industry
groups (e.g., Google, Facebook, Microsoft) to develop deep learning frameworks. It is mainly
due to their popularity in many applications domains over the last few years. The key motivation
for developing these libraries is to provide an eﬃcient and friendly development environment
for researchers to design and implement deep neural networks. Some of the widely used deep
learning frameworks are: Caﬀe, TensorFlow, MatConvNet, Torch7, Theano, Keras, Lasange,
Marvin, Chainer, DeepLearning4J, and MXNet.1 Many of these libraries are well supported,
with dozens of active contributors and large user bases. Because of strong CUDA backends,
many of these frameworks are very fast in training deep networks with billions of parameters.
Based on the number of users in the Google groups and the number of contributors for each of
the frameworks in their corresponding GitHub repositories, we selected ten widely developed
and supported deep learning frameworks including Caﬀe, TensorFlow, MatConvNet, Torch7,
Theano, Keras, Lasagne, Marvin, Chainer, and PyTorch for further discussions.

8.1

CAFFE

Caﬀe is a fully open-source deep learning framework and perhaps the ﬁrst industry-grade deep
learning framework, due to its excellent CNN implementation at the time. It was developed
by the Berkeley Vision and Learning Center (BVLC), as well as community contributors. It
is highly popular within the computer vision community. The code is written in C++, with
CUDA used for GPU computation, and has Python, MATLAB, and commandline interfaces
for training and deployment purposes. Caﬀe stores and communicates data using blobs, which
are 4-dimensional arrays. It provides a complete set of layer types including: convolution, pooling, inner products, nonlinearities like rectiﬁed linear and logistic, local response normalization,
element-wise operations, and diﬀerent types of losses, such as soft-max and hinge. The learned

1 For

a more complete list of Deep Learning frameworks, see http://deeplearning.net/software_links/.

160

8. DEEP LEARNING TOOLS AND LIBRARIES

models can be saved to disk as Google Protocol Buﬀers,2 which have many advantages such
as minimal-size binary strings when serialized, eﬃcient serialization, easier to use programmatically, and a human-readable text format compatible with the binary version, over XML for
serializing structured data. Large-scale data is stored in LevelDB2 databases. There are also pretrained models for the state-of-the-art networks which allows reproducible research. We refer
the readers to the oﬃcial website3 to learn more about the Caﬀe framework.
However, its support for recurrent networks and language modeling is poor in general.
Moreover, it can only be used for image-based application, while not for other deep learning
applications such as text or speech. Another draw back is that the user needs to manually deﬁne
gradient formula for back propagation. As we will discuss further, more recent libraries oﬀer
automatic gradient computations which makes it easy to deﬁne new layers and modules.
Following the footsteps of Caﬀe, Facebook also developed Caﬀe2, a new lightweight,
modular deep learning framework which is built on the original Caﬀe and make improvements
on Caﬀe, particularly a modern computation graph design, supporting large-scale distributed
training, ﬂexibility to port to multiple platforms with ease, and minimalist modularity.

8.2

TENSORFLOW

TensorFlow was originally developed by the Google Brain team. TensorFlow is written with a
Python API over a C/C++ engine for numerical computation using data ﬂow graphs. Multiple
APIs have been provided. The complete programming control is provided with the lowest level
APIs, called TensorFlow Core. Machine learning researchers and others who need ﬁne levels
of control over their models are recommended to use the TensorFlow Core. The higher level
APIs are built on top of TensorFlow Core and they are easier to learn and use, compared to the
TensorFlow Core.
TensorFlow oﬀers automatic diﬀerentiation capabilities, which simplify the process of
deﬁning new operations in the network. It uses data ﬂow graphs to perform numerical computations. The graph nodes represent mathematical operations and the edges represent tensors.
TensorFlow supports multiple backends, CPU or GPU on desktop, server or mobile platforms.
It has well-supported bindings to Python and C++. TensorFlow also has tools to support reinforcement learning. For more details, we suggest the readers to visit the TensorFlow website.4

8.3

MATCONVNET

This is a MATLAB toolbox for the implementation of convolutional neural networks. It was
developed by the Oxford Visual Geometry Group, as an educational and research platform. Un2 Google

Protocol Buﬀers are a way of eﬃciently encoding structured data. It is smaller, faster and simpler than XML. It is
useful in developing programs to communicate with each others over a wire or for storing data. First, a user-deﬁned data
structure is built, and then, a special generated source code is used to easily write and read the structured data to and from
a variety of data streams and using a variety of languages.
3 http://caffe.berkeleyvision.org/
4 https://www.tensorflow.org/

8.4. TORCH7

161

like most existing deep network frameworks which hide the neural network layers behind a wall
of a compiled code, MatConvNet layers can be implemented directly in MATLAB, which is
one of the most popular development environments in computer vision research and in many
other areas. Thus, layers can easily be modiﬁed, extended, or integrated with new ones. However, many of its CNN building blocks, such as convolution, normalization, and pooling, use
optimized CPU and GPU implementations written in C++ and CUDA. The implementation
of the CNN computations in this library are inspired by the Caﬀe framework.
Unlike most existing deep learning frameworks, MatConvNet is simple to compile and
install. The implementation is fully self-contained, requiring only MATLAB and a compatible
C++ compiler. However, MatConvNet does not support recurrent networks. It has a few number
of state-of-the-art pre-trained models. In order to learn more about this framework, we refer
the readers to the oﬃcial website.5

8.4

TORCH7

Torch7 is a scientiﬁc computing framework which provides a wide support for machine learning
algorithms, especially deep neural networks. It provides a MATLAB-like environment and has
strong CUDA and CPU backends. Torch7 is built using Lua that runs on Lua ( JIT) compiler.
The Lua scripting language was chosen to provide three main advantages: (1) Lua is easy for the
development of numerical algorithms, (2) Lua can easily be embedded in a C application, and
provides a great C API, and (3) Lua is the fastest interpreted language (also the fastest Just In
Time ( JIT) compiler). Lua is implemented as a library, written in C.
Torch7 relies on its Tensor class to provide an eﬃcient multi-dimensional array type.
Torch7 has C, C++, and Lua interfaces for model learning and deployment purposes. It also
has an easy to use multi-GPU support which makes it powerful for the learning of deep models.
Torch7 has a large community of developers and is being actively used within large organizations
such as, New York University, Facebook AI lab, Google DeepMind, and Twitter. Unlike most
existing frameworks, Torch7 oﬀers a rich set of RNNs. However, unlike TensorFlow, Torch7 has
the GPU and automatic diﬀerentiation support in two separate libraries, cutorch and autograd,
that makes Torch7 a little less slick and a little harder to learn. The tutorials and demos provided
in the Torch7 oﬃcial website6 help the readers to better understand this framework. Torch7 is
probably the fastest deep learning platform available.

8.5

THEANO

Theano is a Python library and an optimizing compiler which eﬃciently deﬁnes, optimizes, and
evaluates mathematical expressions involving multi-dimensional arrays. Theano was primarily
developed by a machine learning group at the University of Montreal. It combines computer
5 http://www.vlfeat.org/matconvnet/
6 http://torch.ch/docs/getting-started.html

162

8. DEEP LEARNING TOOLS AND LIBRARIES

algebra system with an optimizing compiler that is useful for tasks in which complicated mathematical expressions are evaluated repeatedly, and evaluation speed is critical. Theano provides
diﬀerent implementations for the convolution operation, such as an FFT-based implementation [Mathieu et al., 2014], and an implementation based on the open-source code of image
classiﬁcation network in Krizhevsky et al. [2012]. Several libraries, such as Pylearn2, Keras,
and Lasagne, have been developed on top of Theano providing building blocks for fast experimentation of well-known models. Theano uses symbolic graph for programming a network. Its
symbolic API supports looping control, which makes the implementation of RNNs easy and
eﬃcient.
Theano has implementations for most state-of-the-art networks, either in the form of a
higher-level framework, such as Blocksa and Keras, or in pure Theano. However, Theano is
somewhat low-level and large models have long compilation times. For more information, we
refer the authors to the oﬃcial website.7 However, Theano will no longer be developed (implementing new features) after 2018.

8.6

KERAS

Keras is an open-source high-level neural networks API, written in Python, and capable of
running on top of TensorFlow and Theano. Thus, Keras beneﬁts from the advantages of both
and provides a higher-level and more intuitive set of abstractions, which make it easy to conﬁgure
neural networks, regardless of the back-end scientiﬁc computing library. The primary motivation
behind Keras is to enable fast experimentation with deep neural networks and to go from idea
to results as quickly as possible. The library consists of numerous implementations of neural
network building blocks and tools to make working with image and text data easier. For instance,
Fig. 8.1 shows a Keras code, compared with the code needed to program in Tensorﬂow, to
achieve the same purpose. As shown in this ﬁgure, a neural network can be built in just few lines
of code. For more examples, please visit the Keras oﬃcial website.8
Keras oﬀers two types of deep neural networks including sequence-based networks (where
the inputs ﬂow linearly through the network) and graph-based networks (where the inputs
can skip certain layers). Thus, implementing more complex network architectures such as
GoogLeNet and SqueezeNet is easy. However, Keras does not provide most state-of-the-art
pre-trained models.

8.7

LASAGNE

Lasagne9 is a lightweight python library to construct and train networks in Theano. Unlike
Keras, Lasange was developed to be a light wrapper around Theano. Lasange supports a wide
7 http://deeplearning.net/software/theano/
8 https://keras.io/
9 https://github.com/Lasagne/Lasagne

8.7. LASAGNE

163

(a) TensorFlow

(b) Keras

Figure 8.1: This ﬁgure illustrates the diﬀerence in the size of the code for (a) TensorFlow and
(b) Keras to achieve the same purpose.

164

8. DEEP LEARNING TOOLS AND LIBRARIES

range of deep models including feed-forward networks, such as Convolutional Neural Networks (CNNs), recurrent networks, such as LSTM, and any combination of feed-forward and
recurrent networks. It allows architectures of multiple inputs and multiple outputs, including
auxiliary classiﬁers. Deﬁning cost functions is easy and there is no need to derive gradients due
to Theano’s symbolic diﬀerentiation.

8.8

MARVIN

Marvin10 was developed by researchers from Princeton University’s Vision Group. It is a GPUonly neural network framework, written in C++, and made with simplicity, hackability, speed,
memory consumption, and high-dimensional data in mind. Marvin implementation consists of
two ﬁles, namely marvin.hpp and marvin.cu. It supports multi-GPU, receptive ﬁeld calculations
and ﬁlter visualization. Marvin can easily be installed on diﬀerent operating systems including
Windows, Linux, Mac, and all other platforms that CUDNN supports.
Marvin does not provide most state-of-the-art pre-trained models. It, however, provides
a script to convert Caﬀe models into a format that works in Marvin. Moreover, Marvin does
not provide a good documentation, which makes the building of new models diﬃcult.

8.9

CHAINER

Chainer is an open-source neural network framework with a Python API. It was developed at
Preferred Networks, a machine-learning startup based in Tokyo. Unlike existing deep learning
frameworks, which use the deﬁne-and-run approach, Chainer was designed on the principle of
deﬁne-by-run.
In the deﬁne-and-run-based frameworks, as shown in Fig. 8.2, models are built in two
phases, namely deﬁne and run. In the Deﬁne phase, a computational graph is constructed. More
precisely, the Deﬁne phase is the instantiation of a neural network object based on a model deﬁnition, which speciﬁes the inter-layer connections, initial weights, and activation functions, such
as Protobuf for Caﬀe. In the Run phase, given a set of training examples, the model is trained
by minimizing the loss function using optimization algorithms, such as SGD. However, the
deﬁne-and-run-based frameworks have three major problems, namely (1) ineﬃcient memory
usage, especially for RNN models; (2) limited extensibility; and (3) the inner workings of the
neural network are not accessible to the user, e.g., for the debugging of the models.
Chainer overcomes these drawbacks by providing an easier and more straightforward way
to implement the more complex deep learning architectures. Unlike other frameworks, Chainer
does not ﬁx a model’s computational graph before the model is trained. Instead, the computational graph is implicitly memorized when the forward computation for the training data set
takes place as shown in Fig. 8.2. Thus, Chainer allows to modify networks during runtime, and
10 http://marvin.is/

8.10. PYTORCH

165

Figure 8.2: Unlike deﬁne-and-run-based frameworks, Chainer is a deﬁne-by-run framework,
which does not ﬁx a model’s computational graph before the model is trained.
use arbitrary control ﬂow statements. The Chainer oﬃcial website11 provides several examples
and more details about Chainer core Concept.

8.10 PYTORCH
PyTorch is an open source machine learning library for Python. It was developed by Facebook’s
artiﬁcial intelligence research group. Unlike Torch which is written in Lua (a relatively unpopular
programming language), PyTorch leverages the rising popularity of Python. Since its introduction, PyTorch has quickly become the favorite among machine-learning researchers because it
allows certain complex architectures to be built easily.
PyTorch is mainly inﬂuenced by Chainer. Particularly, PyTorch allows to modify networks
during runtime, i.e., deﬁne-by-run framework. The tutorials and demos provided in the PyTorch
oﬃcial website12 help the readers to better understand this framework.

11 https://github.com/chainer/chainer
12 http://pytorch.org/tutorials/

166

8. DEEP LEARNING TOOLS AND LIBRARIES

Table 8.1: Comparison of deep learning framework in terms of creator, platform, language, and
provided interface
Software
Caﬀe

TensorFlow

Creator
Berkeley Vision
and
Learning Center
(BVLC)
Google Brain team

MatConvNet Oxford Visual
Geometry Group
(VGG)
Torch7
Ronan Collobert,
Koray Kavukcuoglu,
Clement Farabet,
Soumith Chintala
Theano
Monteral University
Keras
Francois Chollet

Lasagne

Saunder Dieleman and
others

Marvin

Jianxiong Xiao,
Shuran Song,
Daniel Suo,
Fisher Yu
Seiya Tokui,
Kenta Oono,
Shohei Hido,
Justin Clayton
Adam Paszke,
Sam Gross,
Soumith Chintala,
Gregory Chanan

Chainer

PyTorch

Platform
Linux,
Mac OS X,
Windows

Core Language
C++

Interface
Python,
MATLAB

Linux,
Mac OS X
Windows
Linux,
Mac OS X,
Windows
Linux,
Mac OS X,
Windows,
Android, iOS
Cross-platform
Linux,
Mac OS X,
Windows
Linux,
Mac OS X,
Windows
Linux,
Mac OS X,
Windows

C++, Python

Python, C/
C++, Java, Go

MATLAB, C++

MATLAB

C, Lua

Python
Python

Lua, LuaJIT,
C, utility
library for
C++
Python
Python

Python

C++

Linux,
Mac OS X,

Python

Linux,
Mac OS X,
Windows

Python, C,
CUDA

Python

8.10. PYTORCH

167

Table 8.2: Comparison of deep learning framework in terms of OpenMP, OpenCL, and Cuda
support, number of available pre-trained models, RNNs, CNNs, and RBM/DBNs
OpenMP
Support
Caﬀe
No
TensorFlow
No
MatConvNet Yes
Torch7
Yes
Software

Theano

Yes

Keras

Yes

Lasagne
Marvin
Chainer
PyTorch

No
Yes
Yes
Yes

OpenCL
Support
Yes
On roadmap
No
Third party implementations
If using Theano
as backend
Under
development
for the Theano
backend, on
roadmap for the
TensorFlo
backend
Yes
Yes
No
No

CUDA
Support
Yes
Yes
Yes
Yes

Pretrained
Models
Yes
Yes
A few
Yes

Yes
Yes
No
Yes

Yes
Yes
Yes
Yes

RBM/
DBNs
Yes
Yes
No
Yes

Under
Yes
development
Yes
A few

Yes

Yes
Yes
Yes
Yes

Yes
No
No
Yes

Yes
A few
A few
Yes

RNNs CNNs

169

CHAPTER

Conclusion
BOOK SUMMARY
The application of deep learning algorithms, especially CNNs, to computer vision problems
have seen a rapid progress. This has led to highly robust, eﬃcient, and ﬂexible vision systems.
This book aimed to introduce diﬀerent aspects of CNNs in computer vision problems. The ﬁrst
part of this book (Chapter 1 and Chapter 2) introduced computer vision and machine learning subjects, and reviewed the traditional feature representation and classiﬁcation methods. We
then brieﬂy covered two generic categories of deep neural networks, namely the feed-forward
and the feed-back networks, their respective computational mechanisms and historical background in Chapter 3. Chapter 4 provided a broad survey of the recent advances in CNNs, including state-of-the-art layers, weight initialization techniques, regularization approaches, and
several loss functions. Chapter 5 reviewed popular gradient-based learning algorithms followed
by gradient-based optimization methodologies. Chapter 6 introduced the most popular CNN
architectures which were mainly developed for object detection and classiﬁcation tasks. A wide
range of CNN applications in computer vision tasks, including image classiﬁcation, object detection, object tracking, pose estimation, action recognition, and scene labeling have been discussed in Chapter 7. Finally, several widely used deep learning libraries have been presented in
Chapter 8 to help the readers to understand the main features of these frameworks.

FUTURE RESEARCH DIRECTIONS
While Convolutional Neural Networks (CNNs) have achieved great performances in experimental evaluations, there are still several challenges which we believe deserve further investigation.
First, deep neural networks require massive amounts of data and computing power to be
trained. However, the manual collection of large-scale labeled datasets is a daunting task. The
requirement of large amounts of labeled data can be relaxed by extracting hierarchical features
through unsupervised learning techniques. Meanwhile, to speed up the learning process, the development of eﬀective and scalable parallel learning algorithms warrants further investigation. In
several application domains, there exists a long tail distribution of object classes, where enough
training examples are not available for the less frequent ones. In such scenarios, deep networks
need to be adapted appropriately to overcome the class-imbalance problem [Khan et al., 2017a].

170

9. CONCLUSION

Second, the memory required to store the huge amounts of parameters is another challenge in deep neural networks, including CNNs. At testing time, these deep models have a high
memory footprint, which makes them unsuitable for mobile and other hand-held devices that
have limited resources. Thus, it is important to investigate how to decrease the complexity of
deep neural networks without the loss of accuracy.
Third, selecting suitable model hyper-parameters (e.g., learning rate, number of layers,
kernel size, number of feature maps, stride, pooling size, and pooling regions) requires considerable skill and experience. These hyper-parameters have a signiﬁcant impact on the accuracy
of deep models. Without an automated hyper-parameter tuning method, one needs to make
manual adjustments to the hyper-parameters using many training runs to achieve the optimal
values. Some recent works [He et al., 2016a, Huang et al., 2016b] have tried optimization techniques for hyper-parameter selection and have shown that there is room to improve current
optimization techniques for learning deep CNN architectures.
Fourth, although deep CNNs have demonstrated an impressive performance on a variety of applications, they still lack solid mathematical and theoretical foundations. There is little
insight into the behavior of these complex neural networks, or how they achieve such good
performance. As a result, it might be diﬃcult to improve their performance, and thus, the development of better models is restricted to trial-and-error. Therefore, attempting to see what
features have been learned, and to understand what computations are performed at each layer
in deep CNNs is an increasingly popular direction of research [Mahendran and Vedaldi, 2015,
2016, Zeiler and Fergus, 2014].
Fifth, several machine learning algorithms, including state-of-the-art CNNs are vulnerable to adversarial examples. Adversarial examples have intentionally been designed to cause
the learned model to make a mistake. For example, by adding a small perturbation to an image
of a “panda” in a way that is undetectable to the human eye, the resultant image is recognized
as a “gibbon” with high conﬁdence (Fig. 9.1). Thus, sophisticated attackers could trick neural networks. For example, one could target autonomous vehicles by using stickers to design an
adversarial stop sign that the vehicle would interpret as a yield sign [Papernot et al., 2017]. Therefore, coming up with sophisticated defense strategies is a vital part of many machine learning
algorithms, including CNNs.
Finally, a major shortcoming of convolutional networks is the inability to work on arbitrary shaped inputs, e.g., cyclic and acyclic graphs. Furthermore, there does not exist a principled
way to incorporate structured losses in the deep network formulation. Such losses are essential
for structured prediction tasks such as the body pose estimation and semantic segmentation.
There have recently been some eﬀorts toward the extension of CNNs to arbitrarily shaped graph
structures [Deﬀerrard et al., 2016, Kipf and Welling, 2016]. However, this area has numerous
promising applications and needs further technical breakthroughs to enable fast and scalable
architectures that are suitable for graphs.

171

Original image classiﬁed as a
panda with 60% conﬁdence.

Small adversarial
perturbation.

Modiﬁed image classiﬁed as a
gibbon with 99% conﬁdence.

Figure 9.1: Adversarial example: by mathematically manipulating a “panda” image in a way that
is undetectable to the human eye (e.g., adding a small perturbation), the learned neural networks
can be tricked into misclassifying objects by sophisticated attackers.
In conclusion, we hope that this book not only provides a better understanding of CNNs
for computer vision tasks, but also facilitates future research activities and application developments in the ﬁeld of computer vision and CNNs.

173

Bibliography
ILSVRC 2016 Results. http://image-net.org/challenges/LSVRC/2016/results 106
A. Alahi, R. Ortiz, and P. Vandergheynst.
Freak: Fast retina keypoint.
In IEEE
Conference on Computer Vision and Pattern Recognition, pages 510–517, 2012. DOI:
10.1109/cvpr.2012.6247715. 14
Ark Anderson, Kyle Shaﬀer, Artem Yankov, Court D. Corley, and Nathan O. Hodas. Beyond
ﬁne tuning: A modular approach to learning on small data. arXiv preprint arXiv:1611.01714,
2016. 72
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: CNN
architecture for weakly supervised place recognition. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5297–5307, 2016. DOI: 10.1109/cvpr.2016.572.
63, 64
Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marques, and Jitendra Malik.
Multiscale combinatorial grouping. In Proc. of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 328–335, 2014. DOI: 10.1109/cvpr.2014.49. 140
Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, and Stefan
Carlsson.
Factors of transferability for a generic convnet representation.
IEEE
Transactions on Pattern Analysis and Machine Intelligence, 38(9):1790–1802, 2016. DOI:
10.1109/tpami.2015.2500224. 72
David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian
McWilliams. The shattered gradients problem: If resnets are the answer, then what is the
question? arXiv preprint arXiv:1702.08591, 2017. 95, 98
Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf ). Computer Vision and Image Understanding, 110(3):346–359, 2008. DOI:
10.1016/j.cviu.2007.09.014. 7, 11, 14, 19, 29
Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeﬀrey Mark
Siskind.
Automatic diﬀerentiation in machine learning: A survey.
arXiv preprint
arXiv:1502.05767, 2015. 90

174

BIBLIOGRAPHY

N. Bayramoglu and A. Alatan. Shape index sift: Range image recognition using local features. In 20th International Conference on Pattern Recognition, pages 352–355, 2010. DOI:
10.1109/icpr.2010.95. 13
Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems, 19:153, 2007. 70
Zhou Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene CNNs. In International Conference on Learning Representations,
2015. 94, 97
Leo Breiman.
Random forests.
Machine Learning, 45(1):5–32, 2001. DOI:
10.1023/A:1010933404324. 7, 11, 22, 26, 29
M. Brown and D. Lowe. Invariant features from interest point groups. In Proc. of the British
Machine Vision Conference, pages 23.1–23.10, 2002. DOI: 10.5244/c.16.23. 20
M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Binary robust independent elementary features. In 11th European Conference on Computer Vision, pages 778–792, 2010. DOI:
10.1007/978-3-642-15561-1_56. 14
Jean-Pierre Changeux and Paul Ricoeur. What Makes Us Think?: A Neuroscientist and a Philosopher Argue About Ethics, Human Nature, and the Brain. Princeton University Press, 2002. 40
Ken Chatﬁeld, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil
in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
DOI: 10.5244/c.28.6. 123
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille.
Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv
preprint arXiv:1412.7062, 2014. 50, 133
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the
properties of neural machine translation: Encoder-decoder approaches. arXiv preprint
arXiv:1409.1259, 2014. DOI: 10.3115/v1/w14-4012. 38
Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face veriﬁcation. In Computer Vision and Pattern Recognition, (CVPR). IEEE Computer Society Conference on, volume 1, pages 539–546, 2005. DOI:
10.1109/cvpr.2005.202. 67
Dan C. Cireşan, Ueli Meier, Jonathan Masci, Luca M. Gambardella, and Jürgen Schmidhuber. High-performance neural networks for visual object classiﬁcation. arXiv preprint
arXiv:1102.0183, 2011. 102

BIBLIOGRAPHY

175

Corinna Cortes. Support-vector networks. Machine Learning, 20(3):273–297, 1995. DOI:
10.1007/bf00994018. 7, 11, 22, 29
Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2:265–292, 2001. 66
Michaël Deﬀerrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks
on graphs with fast localized spectral ﬁltering. In Advances in Neural Information Processing
Systems, pages 3844–3852, 2016. 170
Li Deng. A tutorial survey of architectures, algorithms, and applications for deep learning.
APSIPA Transactions on Signal and Information Processing, 3:e2, 2014. DOI: 10.1017/atsip.2014.4. 117
Jeﬀ Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for
visual recognition and description. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2625–2634, 2015. DOI: 10.1109/cvpr.2015.7298878. 150, 155
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011. 83
Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning.
arXiv preprint arXiv:1603.07285, 2016. 60
Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 32(9):1627–1645, 2010. DOI: 10.1109/tpami.2009.167. 121
R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics,
7(7):179–188, 1936. DOI: 10.1111/j.1469-1809.1936.tb02137.x. 22
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of online learning
and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
DOI: 10.1006/jcss.1997.1504. 22
Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of
Statistics, 29:1189–1232, 2000. 22
Kunihiko Fukushima and Sei Miyake. Neocognitron: A self-organizing neural network model
for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets,
pages 267–285. Springer, 1982. DOI: 10.1007/978-3-642-46466-9_18. 43
Ross Girshick. Fast R-CNN. In Proc. of the IEEE International Conference on Computer Vision,
pages 1440–1448, 2015. DOI: 10.1109/iccv.2015.169. 60

176

BIBLIOGRAPHY

Ross Girshick, Jeﬀ Donahue, Trevor Darrell, and Jitendra Malik. Region-based convolutional
networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):142–158, 2016. DOI: 10.1109/tpami.2015.2437384. 120
Xavier Glorot and Yoshua Bengio. Understanding the diﬃculty of training deep feedforward
neural networks. In Aistats, 9:249–256, 2010. 70
Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks.
arXiv:1701.00160, 2016. 142

arXiv preprint

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in
Neural Information Processing Systems, pages 2672–2680, 2014. 141, 142, 149
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http:
//www.deeplearningbook.org 54
Alex Graves and Jürgen Schmidhuber. Framewise phoneme classiﬁcation with bidirectional
LSTM and other neural network architectures. Neural Networks, 18(5):602–610, 2005. DOI:
10.1016/j.neunet.2005.06.042. 38
Alex Graves, Greg Wayne, and Ivo Danihelka.
arXiv:1410.5401, 2014. 38

Neural turing machines.

arXiv preprint

Alex Graves et al. Supervised Sequence Labelling with Recurrent Neural Networks, volume 385.
Springer, 2012. DOI: 10.1007/978-3-642-24797-2. 31
Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. Perceptual organization and recognition
of indoor scenes from RGB-D images. In Proc. of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 564–571, 2013. DOI: 10.1109/cvpr.2013.79. 140
Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning rich features from
RGB-D images for object detection and segmentation. In European Conference on Computer
Vision, pages 345–360. Springer, 2014. DOI: 10.1007/978-3-319-10584-0_23. 139, 141
Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J. Douglas, and
H. Sebastian Seung. Digital selection and analogue ampliﬁcation coexist in a cortex-inspired
silicon circuit. Nature, 405(6789):947, 2000. DOI: 10.1038/35016072. 55
Munawar Hayat, Salman H. Khan, Mohammed Bennamoun, and Senjian An. A spatial layout
and scale invariant feature representation for indoor scene classiﬁcation. IEEE Transactions
on Image Processing, 25(10):4829–4841, 2016. DOI: 10.1109/tip.2016.2599292. 96
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep
convolutional networks for visual recognition. In European Conference on Computer Vision,
pages 346–361. Springer, 2014. DOI: 10.1007/978-3-319-10578-9_23. 62

BIBLIOGRAPHY

177

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proc. of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015a. DOI: 10.1109/iccv.2015.123.
71
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep
convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015b. DOI: 10.1109/tpami.2015.2389824. 61, 125
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 770–778, 2016a. DOI: 10.1109/cvpr.2016.90. 77, 106, 170
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016b. DOI:
10.1007/978-3-319-46493-0_38. 108, 111
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. arXiv preprint
arXiv:1703.06870, 2017. DOI: 10.1109/iccv.2017.322. 60, 140, 141
Geoﬀrey E. Hinton, Simon Osindero, and Yee-Whye Teh.
A fast learning algorithm for deep belief nets.
Neural Computation, 18(7):1527–1554, 2006. DOI:
10.1162/neco.2006.18.7.1527. 70
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation,
9(8):1735–1780, 1997. DOI: 10.1162/neco.1997.9.8.1735. 38
Gao Huang, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten. Densely
connected convolutional networks.
arXiv preprint arXiv:1608.06993, 2016a. DOI:
10.1109/cvpr.2017.243. 114, 115
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks
with stochastic depth. In European Conference on Computer Vision, pages 646–661, 2016b.
DOI: 10.1007/978-3-319-46493-0_39. 170
David H. Hubel and Torsten N. Wiesel. Receptive ﬁelds of single neurones in the cat’s
striate cortex. The Journal of Physiology, 148(3):574–591, 1959. DOI: 10.1113/jphysiol.1959.sp006308. 43
Sergey Ioﬀe and Christian Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 76, 77
Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In
Advances in Neural Information Processing Systems, pages 2017–2025, 2015. 63

178

BIBLIOGRAPHY

Anil K. Jain, Jianchang Mao, and K. Moidin Mohiuddin. Artiﬁcial neural networks: A tutorial.
Computer, 29(3):31–44, 1996. DOI: 10.1109/2.485891. 39
Katarzyna Janocha and Wojciech Marian Czarnecki.
On loss functions for deep
neural networks in classiﬁcation.
arXiv preprint arXiv:1702.05659, 2017. DOI:
10.4467/20838476si.16.004.6185. 68
Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition (CVPR),
IEEE Conference on, pages 3304–3311, 2010. DOI: 10.1109/cvpr.2010.5540039. 63
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and
Li Fei-Fei. Large-scale video classiﬁcation with convolutional neural networks. In Proc. of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. DOI:
10.1109/cvpr.2014.223. 150, 152
Salman H. Khan. Feature learning and structured prediction for scene understanding. Ph.D.
Thesis, University of Western Australia, 2016. 135
Salman H. Khan, Mohammed Bennamoun, Ferdous Sohel, and Roberto Togneri. Automatic
shadow detection and removal from a single image. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 38(3):431–446, 2016a. DOI: 10.1109/tpami.2015.2462355. 141
Salman H. Khan, Munawar Hayat, Mohammed Bennamoun, Roberto Togneri, and Ferdous A. Sohel. A discriminative representation of convolutional features for indoor
scene recognition. IEEE Transactions on Image Processing, 25(7):3372–3383, 2016b. DOI:
10.1109/tip.2016.2567076. 72, 95
Salman H. Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A. Sohel, and
Roberto Togneri. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 2017a. DOI:
10.1109/tnnls.2017.2732482. 169
Salman H. Khan, Munawar Hayat, and Fatih Porikli. Scene categorization with spectral features. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5638–
5648, 2017b. DOI: 10.1109/iccv.2017.601. 94
Salman H. Khan, Xuming He, Fatih Porikli, Mohammed Bennamoun, Ferdous Sohel, and
Roberto Togneri. Learning deep structured network for weakly supervised change detection.
In Proc. of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 1–7, 2017c.
DOI: 10.24963/ijcai.2017/279. 141
Salman Hameed Khan, Mohammed Bennamoun, Ferdous Sohel, and Roberto Togneri. Automatic feature learning for robust shadow detection. In Computer Vision and Pattern Recognition
(CVPR), IEEE Conference on, pages 1939–1946, 2014. DOI: 10.1109/cvpr.2014.249. 93

BIBLIOGRAPHY

179

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014. 85, 86
Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional
networks. arXiv preprint arXiv:1609.02907, 2016. 170
Philipp Krähenbühl and Vladlen Koltun. Eﬃcient inference in fully connected CRFS with
Gaussian edge potentials. In Advances in Neural Information Processing Systems 24, pages 109–
117, 2011. 132, 135
Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E. Hinton. Imagenet classiﬁcation with
deep convolutional neural networks. In Advances in Neural Information Processing Systems,
pages 1097–1105, 2012. DOI: 10.1145/3065386. 45, 74, 102, 117, 123, 140, 150, 162
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural
networks without residuals. arXiv preprint arXiv:1605.07648, 2016. 112, 113
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition,
IEEE Computer Society Conference on, 2:2169–2178, 2006. DOI: 10.1109/cvpr.2006.68. 61
Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne
Hubbard, and Lawrence D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989. DOI: 10.1162/neco.1989.1.4.541. 43
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haﬀner. Gradient-based learning applied to document recognition. Proc. of the IEEE, 86(11):2278–2324, 1998. DOI:
10.1109/5.726791. 101, 102
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photorealistic single image super-resolution using a generative adversarial network. arXiv preprint
arXiv:1609.04802, 2016. DOI: 10.1109/cvpr.2017.19. 141, 145, 147, 148
Stefan Leutenegger, Margarita Chli, and Roland Y. Siegwart. BRISK: Binary robust invariant
scalable keypoints. In Proc. of the International Conference on Computer Vision, pages 2548–
2555, 2011. DOI: 10.1109/iccv.2011.6126542. 14
Li-Jia Li, Richard Socher, and Li Fei-Fei. Towards total scene understanding: Classiﬁcation, annotation and segmentation in an automatic framework. In Computer Vision and Pattern Recognition, (CVPR). IEEE Conference on, pages 2036–2043, 2009. DOI:
10.1109/cvpr.2009.5206718. 135
Min Lin, Qiang Chen, and Shuicheng Yan.
arXiv:1312.4400, 2013. 56, 103

Network in network.

arXiv preprint

180

BIBLIOGRAPHY

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic
segmentation. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 3431–3440, 2015. DOI: 10.1109/cvpr.2015.7298965. 127, 128, 130
David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal
on Computer Vision, 60(2):91–110, 2004. DOI: 10.1023/b:visi.0000029664.99615.94. 7, 11,
14, 16, 17, 19, 29
A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them.
In Computer Vision and Pattern Recognition, (CVPR). IEEE Computer Society Conference on,
pages 5188–5196, 2015. DOI: 10.1109/cvpr.2015.7299155. 97, 99, 170
A. Mahendran and A. Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. International Journal on Computer Vision, 120(3):233–255, 2016. DOI:
10.1007/s11263-016-0911-8. 170
Michael Mathieu, Mikael Henaﬀ, and Yann LeCun. Fast training of convolutional networks
through FFTs. In International Conference on Learning Representations (ICLR2014), 2014.
162
Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in
nervous activity.
The Bulletin of Mathematical Biophysics, 5(4):115–133, 1943. DOI:
10.1007/bf02478259. 40
Dmytro Mishkin and Jiri Matas. All you need is a good INIT. arXiv preprint arXiv:1511.06422,
2015. 71
B. Triggs and N. Dalal. Histograms of oriented gradients for human detection. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, pages 1063–6919,
2005. DOI: 10.1109/CVPR.2005.177. 7, 11, 14, 15, 29
Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of
convergence o (1/k2). In Doklady an SSSR, 269:543–547, 1983. 82
Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for
semantic segmentation. In Proc. of the IEEE International Conference on Computer Vision,
pages 1520–1528, 2015. DOI: 10.1109/iccv.2015.178. 130
Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proc. of the ACM on
Asia Conference on Computer and Communications Security, (ASIA CCS’17), pages 506–519,
2017. DOI: 10.1145/3052973.3053009. 170
Razvan Pascanu, Yann N. Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point
problem for non-convex optimization. arXiv preprint arXiv:1405.4604, 2014. 81

BIBLIOGRAPHY

181

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point
sets for 3D classiﬁcation and segmentation. arXiv preprint arXiv:1612.00593, 2016. DOI:
10.1109/cvpr.2017.16. 117, 118, 119
J. R. Quinlan. Induction of decision trees. Machine Learning, pages 81–106, 1986. DOI:
10.1007/bf00116251. 7, 11, 22, 26, 29
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with
deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
141, 145, 149
H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian. HOPC: Histogram of oriented
principal components of 3D pointclouds for action recognition. In 13th European Conference
on Computer Vision, pages 742–757, 2014. DOI: 10.1007/978-3-319-10605-2_48. 13
Hossein Rahmani and Mohammed Bennamoun. Learning action recognition model from
depth and skeleton videos. In The IEEE International Conference on Computer Vision (ICCV),
2017. DOI: 10.1109/iccv.2017.621. 150
Hossein Rahmani and Ajmal Mian. 3D action recognition from novel viewpoints. In Computer
Vision and Pattern Recognition, (CVPR). IEEE Computer Society Conference on, pages 1506–
1515, 2016. DOI: 10.1109/cvpr.2016.167. 74, 150
Hossein Rahmani, Ajmal Mian, and Mubarak Shah. Learning a deep model for human action recognition from novel viewpoints. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2017. DOI: 10.1109/tpami.2017.2691768. 74, 150
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time
object detection with region proposal networks. In Advances in Neural Information Processing
Systems, pages 91–99, 2015. DOI: 10.1109/tpami.2016.2577031. 61, 123
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An eﬃcient alternative to SIFT or SURF. In Proc. of the International Conference on Computer Vision, pages 2564–
2571, 2011. DOI: 10.1109/iccv.2011.6126544. 14
Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint
arXiv:1609.04747, 2016. 81
David E. Rumelhart, Geoﬀrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. Technical report, DTIC Document, 1985. DOI: 10.1016/b9781-4832-1446-7.50035-2. 34
Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear
dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
70

182

BIBLIOGRAPHY

Florian Schroﬀ, Dmitry Kalenichenko, and James Philbin. Facenet: A uniﬁed embedding for
face recognition and clustering. In Proc. of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 815–823, 2015. DOI: 10.1109/cvpr.2015.7298682. 68
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features oﬀ-the-shelf: An astounding baseline for recognition. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 806–813, 2014. DOI:
10.1109/cvprw.2014.131. 72
J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and
A. Blake. Real-time human pose recognition in parts from single depth images. In Computer
Vision and Pattern Recognition, (CVPR). IEEE Computer Society Conference on, pages 1297–
1304, 2011. DOI: 10.1145/2398356.2398381. 26
Ashish Shrivastava, Tomas Pﬁster, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb.
Learning from simulated and unsupervised images through adversarial training. arXiv
preprint arXiv:1612.07828, 2016. DOI: 10.1109/cvpr.2017.241. 74
Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Proc. of the 27th International Conference on Neural Information Processing
Systems—Volume 1, (NIPS’14), pages 568–576, 2014a. 150, 152, 153
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014b. 50, 70, 104, 123
Fisher Yu, Yinda Zhang, Shuran Song, Ari Seﬀ, and Jianxiong Xiao. Construction of a
large-scale image dataset using deep learning with humans in the loop. arXiv preprint
arXiv:1506.03365, 2015. 146, 147
Shuran Song and Jianxiong Xiao. Deep sliding shapes for a modal 3D object detection in
RGB-D images. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 808–816, 2016. DOI: 10.1109/cvpr.2016.94. 136
Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. Sun RGB-D: A RGB-D scene understanding benchmark suite. In Proc. of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 567–576, 2015. DOI: 10.1109/cvpr.2015.7298655. 136
Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015. 146
Nitish Srivastava, Geoﬀrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine
Learning Research, 15(1):1929–1958, 2014. 75, 79, 102

BIBLIOGRAPHY

183

Rupesh Kumar Srivastava, Klaus Greﬀ, and Jürgen Schmidhuber. Highway networks. arXiv
preprint arXiv:1505.00387, 2015. 108
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9,
2015. DOI: 10.1109/cvpr.2015.7298594. 105, 106, 107
Yichuan Tang. Deep learning using linear support vector machines.
arXiv:1306.0239, 2013. 67

arXiv preprint

Tijmen Tieleman and Geoﬀrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running
average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2),
2012. 85
Jasper R. R. Uijlings, Koen E. A. Van De Sande, Theo Gevers, and Arnold W. M. Smeulders.
Selective search for object recognition. International Journal of Computer Vision, 104(2):154–
171, 2013. DOI: 10.1007/s11263-013-0620-5. 61, 120, 122
Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.
Conditional image generation with pixel CNN decoders. In Advances in Neural Information
Processing Systems, pages 4790–4798, 2016. 141
Li Wan, Matthew Zeiler, Sixin Zhang, Yann L. Cun, and Rob Fergus. Regularization of neural
networks using dropconnect. In Proc. of the 30th International Conference on Machine Learning
(ICML’13), pages 1058–1066, 2013. 75
Heng Wang, A. Klaser, C. Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR’11),
pages 3169–3176, 2011a. DOI: 10.1109/cvpr.2011.5995407. 154
Zhenhua Wang, Bin Fan, and Fuchao Wu. Local intensity order pattern for feature description.
In Proc. of the International Conference on Computer Vision, pages 1550–5499, 2011b. DOI:
10.1109/iccv.2011.6126294. 14
Jason Weston, Chris Watkins, et al. Support vector machines for multi-class pattern recognition.
In ESANN, 99:219–224, 1999. 67
Bernard Widrow, Marcian E. Hoﬀ, et al. Adaptive switching circuits. In IRE WESCON Convention Record, 4:96–104, New York, 1960. DOI: 10.21236/ad0241531. 33
Jason Yosinski, Jeﬀ Clune, Thomas Fuchs, and Hod Lipson. Understanding neural networks
through deep visualization. In In ICML Workshop on Deep Learning, Citeseer, 2015. 95, 98

184

BIBLIOGRAPHY

Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv
preprint arXiv:1511.07122, 2015. 50, 141
Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional
neural networks. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 4353–4361, 2015. DOI: 10.1109/cvpr.2015.7299064. 141
Matthew D. Zeiler.
Adadelta: An adaptive learning rate method.
arXiv:1212.5701, 2012. 84

arXiv preprint

Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In
European Conference on Computer Vision, pages 818–833, Springer, 2014. DOI: 10.1007/9783-319-10590-1_53. 44, 94, 95, 97, 170
Yinda Zhang, Mingru Bai, Pushmeet Kohli, Shahram Izadi, and Jianxiong Xiao. Deepcontext: Context-encoding neural pathways for 3D holistic scene understanding. arXiv preprint
arXiv:1603.04922, 2016. DOI: 10.1109/iccv.2017.135. 135, 139
Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for neural networks for
image processing. arXiv preprint arXiv:1511.08861, 2015. 67, 68
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene
parsing network. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 1–7, 2017. DOI: 10.1109/cvpr.2017.660. 141
Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su,
Dalong Du, Chang Huang, and Philip H. S. Torr. Conditional random ﬁelds as recurrent
neural networks. In Proc. of the IEEE International Conference on Computer Vision, pages 1529–
1537, 2015. DOI: 10.1109/iccv.2015.179. 141
C. Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In
European Conference on Computer Vision, pages 391–405, Springer, 2014. DOI: 10.1007/9783-319-10602-1_26. 61, 132

185

Authors’ Biographies
SALMAN KHAN
Salman Khan received a B.E. in Electrical Engineering from the National University of Sciences and Technology (NUST) in 2012 with high distinction, and a Ph.D. from The University
of Western Australia (UWA) in 2016. His Ph.D. thesis received an Honorable Mention on the
Dean’s list Award. In 2015, he was a visiting researcher with National ICT Australia, Canberra
Research Laboratories. He is currently a Research Scientist with Data61, Commonwealth Scientiﬁc and Industrial Research Organization (CSIRO), and has been an Adjunct Lecturer with
Australian National University (ANU) since 2016. He was awarded several prestigious scholarships such as the International Postgraduate Research Scholarship (IPRS) for Ph.D. and the
Fulbright Scholarship for MS. He has served as a program committee member for several leading computer vision and robotics conferences such as IEEE CVPR, ICCV, ICRA, WACV,
and ACCV. His research interests include computer vision, pattern recognition, and machine
learning.

HOSSEIN RAHMANI
Hossein Rahmani received his BSc. in Computer Software Engineering in 2004 from Isfahan
University of Technology, Isfahan, Iran and his MSc. degree in Software Engineering in 2010
from Shahid Beheshti University, Tehran, Iran. He completed his Ph.D. from The University
of Western Australia in 2016. He has published several papers in top conferences and journals
such as CVPR, ICCV, ECCV, and TPAMI. He is currently a Research Fellow in the School of
Computer Science and Software Engineering at The University of Western Australia. He has
served as a reviewer for several leading computer vision conferences and journals such as IEEE
TPAMI, and CVPR. His research interests include computer vision, action recognition, 3D
shape analysis, and machine learning.

SYED AFAQ ALI SHAH
Syed Afaq Ali Shah received his B.Sc. and M.Sc. degrees in Electrical Engineering from the
University of Engineering and Technology (UET) Peshawar, in 2003 and 2010, respectively.
He obtained his Ph.D. from the University of Western Australia in the area of computer vision and machine learning in 2016. He is currently working as a research associate in the school
of computer science and software engineering at the University of Western Australia, Craw-

186

AUTHORS’ BIOGRAPHIES

ley, Australia. He has been awarded the “Start Something Prize for Research Impact through
Enterprise” for 3D facial analysis project funded by the Australian Research Council. He has
served as a program committee member for ACIVS 2017. His research interests include deep
learning, computer vision, and pattern recognition.

MOHAMMED BENNAMOUN
Mohammed Bennamoun received his M.Sc. from Queen’s University, Kingston, Canada in
the area of Control Theory, and his Ph.D. from Queen’s QUT in Brisbane, Australia, in the
area of Computer Vision. He lectured Robotics at Queen’s, and then joined QUT in 1993 as an
associate lecturer. He is currently a Winthrop Professor. He served as the Head of the School of
Computer Science and Software Engineering at The University of Western Australia (UWA)
for ﬁve years (February 2007–March 2012). He served as the Director of a University Centre at
QUT: The Space Centre for Satellite Navigation from 1998–2002.
He served as a member of the Australian Research Council (ARC) College of Experts
from 2013–2015. He was an Erasmus Mundus Scholar and Visiting Professor in 2006 at the
University of Edinburgh. He was also a visiting professor at CNRS (Centre National de la
Recherche Scientiﬁque) and Telecom Lille1, France in 2009, The Helsinki University of Technology in 2006, and The University of Bourgogne and Paris 13 in France in 2002–2003. He is
the co-author of the book Object Recognition: Fundamentals and Case Studies (Springer-Verlag,
2001), and the co-author of an edited book Ontology Learning and Knowledge Discovery Using
the Web, published in 2011.
Mohammed has published over 100 journal papers and over 250 conference papers, and
secured highly competitive national grants from the ARC, government, and other funding bodies. Some of these grants were in collaboration with industry partners (through the ARC Linkage Project scheme) to solve real research problems for industry, including Swimming Australia, the West Australian Institute of Sport, a textile company (Beaulieu Paciﬁc), and AAMGeoScan. He worked on research problems and collaborated (through joint publications, grants,
and supervision of Ph.D. students) with researchers from diﬀerent disciplines including animal biology, speech processing, biomechanics, ophthalmology, dentistry, linguistics, robotics,
photogrammetry, and radiology. He has collaborated with researchers from within Australia
(e.g., CSIRO), as well as internationally (e.g. Germany, France, Finland, U.S.). He won several
awards, including the Best Supervisor of the Year Award at QUT in 1998, an award for teaching
excellence (research supervision), and the Vice-Chancellor’s Award for Research Mentorship in
2016. He also received an award for research supervision at UWA in 2008.
He has served as a guest editor for a couple of special issues in international journals, such
as the International Journal of Pattern Recognition and Artiﬁcial Intelligence (IJPRAI). He was selected to give conference tutorials at the European Conference on Computer Vision (ECCV),
the International Conference on Acoustics Speech and Signal Processing (IEEE ICASSP), the
IEEE International Conference on Computer Vision (CVPR 2016), Interspeech (2014), and

AUTHORS’ BIOGRAPHIES

187

a course at the International Summer School on Deep Learning (DeepLearn2017). He has
organized several special sessions for conferences, including a special session for the IEEE International Conference in Image Processing (IEEE ICIP). He was on the program committee
of many conferences, e.g., 3D Digital Imaging and Modeling (3DIM) and the International
Conference on Computer Vision. He also contributed in the organization of many local and
international conferences. His areas of interest include control theory, robotics, obstacle avoidance, object recognition, machine/deep learning, signal/image processing, and computer vision
(particularly 3D).

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.6
Linearized                      : No
Create Date                     : 2018:02:12 13:36:34-05:00
Creator                         : LaTeX with hyperref package
Modify Date                     : 2018:02:14 10:39:16-08:00
XMP Toolkit                     : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26
Creator Tool                    : LaTeX with hyperref package
Metadata Date                   : 2018:02:14 10:39:16-08:00
Producer                        : XeTeX 0.99998
Format                          : application/pdf
Document ID                     : uuid:29a67d82-47d6-4333-ae46-6804466b2abc
Instance ID                     : uuid:07e6a126-69ab-473f-bb4e-70d3c6b8f325
Schemas Namespace URI           : http://ns.adobe.com/pdf/1.3/
Schemas Prefix                  : pdf
Schemas Schema                  : Adobe PDF Schema
Schemas Property Category       : internal
Schemas Property Description    : A name object indicating whether the document has been modified to include trapping information
Schemas Property Name           : Trapped
Schemas Property Value Type     : Text
Page Mode                       : UseOutlines
Page Count                      : 209

EXIF Metadata provided by EXIF.tools

A Guide To Convolutional Neural Networks For Computer Vision

Navigation menu

Versions of this User Manual:

Views

Navigation