Deep Learning With Py Torch Quick Start Guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 150
Download | |
Open PDF In Browser | View PDF |
Deep Learning with PyTorch Quick Start Guide Learn to train and deploy neural network models in Python David Julian BIRMINGHAM - MUMBAI Deep Learning with PyTorch Quick Start Guide Copyright © 2018 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Commissioning Editor: Amey Varangaonkar Acquisition Editor: Noyonika Das Content Development Editor: Kirk Dsouza Technical Editor: Sushmeeta Jena Copy Editor: Safis Editing Project Coordinator: Hardik Bhinde Proofreader: Safis Editing Indexer: Mariammal Chettiyar Graphics: Alishon Mendonsa Production Coordinator: Nilesh Mohite First published: December 2018 Production reference: 1201218 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78953-409-2 www.packtpub.com mapt.io Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website. Why subscribe? Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Mapt is fully searchable Copy and paste, print, and bookmark content Packt.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details. At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. Contributors About the author David Julian is a freelance technology consultant and educator. He has worked as a consultant for government, private, and community organizations on a variety of projects, including using machine learning to detect insect outbreaks in controlled agricultural environments (Urban Ecological Systems Ltd., Bluesmart Farms), designing and implementing event management data systems (Sustainable Industry Expo, Lismore City Council), and designing multimedia interactive installations (Adelaide University). He has also written Designing Machine Learning Systems With Python for Packt Publishing and was technical reviewer for Python Machine Learning and Hands-On Data Structures and Algorithms with Python - Second Edition, published by Packt. About the reviewer AshishSingh Bhatia has more than 10 years' IT experience in different domains, including ERP, banking, education, and resource management. He is a learner, reader, and developer at heart. He is passionate about Python, Java, and R. He loves to explore new technologies. He has also published two books: Machine Learning with Java and R and Natural Language Processing with Java. Apart from this, he has also recorded a video tutorial on PyTorch. Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. Table of Contents Preface 1 Chapter 1: Introduction to PyTorch What is PyTorch? Installing PyTorch Digital Ocean Tunneling in to IPython Amazon Web Services (AWS) Basic PyTorch operations Default value initialization Converting between tensors and NumPy arrays Slicing and indexing and reshaping In place operations Loading data PyTorch dataset loaders Displaying an image DataLoader Creating a custom dataset Transforms ImageFolder Concatenating datasets Summary Chapter 2: Deep Learning Fundamentals Approaches to machine learning Learning tasks Unsupervised learning Clustering Principle component analysis Reinforcement learning Supervised learning Classification Evaluating classifiers Features Handling text and categories Models Linear algebra review Linear models Gradient descent Multiple features The normal equation 6 7 9 11 12 13 13 14 15 18 20 21 23 25 25 26 28 29 30 30 32 33 34 35 35 35 36 36 36 37 38 39 40 40 44 46 49 50 Table of Contents Logistic regression Nonlinear models Artificial neural networks The perceptron Summary Chapter 3: Computational Graphs and Linear Models autograd Computational graphs Linear models Linear regression in PyTorch Saving models Logistic regression Activation functions in PyTorch Multi-class classification example Summary Chapter 4: Convolutional Networks Hyper-parameters and multilayered networks Benchmarking models Convolutional networks A single convolutional layer Multiple kernels Multiple convolutional layers Pooling layers Building a single-layer CNN Building a multiple-layer CNN Batch normalization Summary Chapter 5: Other NN Architectures Introduction to recurrent networks Recurrent artificial neurons Implementing a recurrent network Long short-term memory networks Implementing an LSTM Building a language model with a gated recurrent unit Summary Chapter 6: Getting the Most out of PyTorch Multiprocessor and distributed environments Using a GPU Distributed environments torch.distributed torch.multiprocessing Optimization techniques Optimizer algorithms [ ii ] 50 53 54 55 59 60 61 63 64 64 68 69 71 72 78 79 79 81 86 86 88 89 89 90 92 94 96 97 97 98 99 105 108 109 115 116 116 117 119 119 120 121 121 Table of Contents Learning rate scheduler Parameter groups 123 124 126 128 133 Pretrained models Implementing a pretrained model Summary Other Books You May Enjoy 135 Index 138 [ iii ] Preface PyTorch is surprisingly easy to learn and provides advanced features such as a supporting multiprocessor, as well as distributed and parallel computation. PyTorch has a library of pre-trained models, providing out-of-the-box solutions for image classification. PyTorch offers one of the most accessible entry points into cutting-edge deep learning. It is tightly integrated with the Python programming language, so for Python programmers, coding it seems natural and intuitive. The unique, dynamic way of treating computational graphs means that PyTorch is both efficient and flexible. Who this book is for This book is for anyone who wants a straightforward, practical introduction to deep learning using PyTorch. The aim is to give you an understanding of deep learning models by direct experimentation. This book is perfect for those who are familiar with Python, know some machine learning basics, and are looking for a way to productively develop their skills. The book will focus on the most important features and give practical examples. It assumes you have a working knowledge of Python and are familiar with the relevant mathematical ideas, including with linear algebra and differential calculus. The book provides enough theory to get you up and running without requiring rigorous mathematical understanding. By the end of the book, you will have a practical knowledge of deep learning systems and able to apply PyTorch models to solve the problems that you care about. What this book covers Chapter 1, Introduction to PyTorch, gets you up and running with PyTorch, demonstrates its installation on a variety of platforms, and explores key syntax elements and how to import and use data in PyTorch. Chapter 2, Deep Learning Fundamentals, is a whirlwind tour of the basics of deep learning, covering the mathematics and theory of optimization, linear networks, and neural networks. Chapter 3, Computational Graphs and Linear Models, demonstrates how to calculate the error gradient of a linear network and how to harness it to classify images. Preface Chapter 4, Convolutional Networks, examines the theory of convolutional networks and how to use them for image classification. Chapter 5, Other NN Architectures, discusses the theory behind recurrent networks and shows how to use them to make predictions about sequence data. It also discusses long short-term memory networks (LSTMs) and has you build a language model to predict text. Chapter 6, Getting the Most out of PyTorch, examines some advanced features, such as using PyTorch in multiprocessor and parallel environments. You will build a flexible solution for image classification using out-of-the-box pre-trained models. To get the most out of this book This book does not assume any specialist knowledge, only solid general computer skills. Python is a relatively easy (and incredibly useful!) language to learn, so don't worry if you have limited or no programming background. The book does contain some relatively simple mathematics, and some theory, that some readers may find difficult at first. Deep learning models are complex systems and understanding the behavior of even simple neural networks is a non-trivial exercise. Fortunately, PyTorch acts as a high-level framework around these complicated systems, so it is possible to achieve very good results without an expert understanding of the theoretical foundations. Installing the software is easy, and essentially only two packages are required: the Anaconda distribution of Python, and PyTorch itself. The software runs on Windows 7 and 10 , macOS 10.10 or above, and most versions of Linux. It can be run on a desktop machine or in a server environment. All the code in this book was tested using PyTorch version 1.0 and Python 3, running on Ubuntu 16. Download the example code files You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you. [2] Preface You can download the code files by following these steps: 1. 2. 3. 4. Log in or register at www.packt.com. Select the SUPPORT tab. Click on Code Downloads & Errata. Enter the name of the book in the Search box and follow the onscreen instructions. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of: WinRAR/7-Zip for Windows Zipeg/iZip/UnRarX for Mac 7-Zip/PeaZip for Linux The code bundle for the book is also hosted on GitHub at https://github.com/ PacktPublishing/Deep-Learning-with-PyTorch-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out! Download the color images We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/ downloads/9781789534092_ColorImages.pdf. Conventions used There are a number of text conventions used throughout this book. CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system." [3] Preface A block of code is set as follows: import numpy as np x = np.array([[1,2,3],[4,5,6],[1,2,5]]) y = np.linalg.inv(x) print (y) print (np.dot(x,y)) When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold: import numpy as np x = np.array([[1,2,3],[4,5,6],[1,2,5]]) y = np.linalg.inv(x) print (y) print (np.dot(x,y)) Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel." Warnings or important notes appear like this. Tips and tricks appear like this. Get in touch Feedback from our readers is always welcome. General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at customercare@packtpub.com. Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details. [4] Preface Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material. If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com. Reviews Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you! For more information about Packt, please visit packt.com. [5] 1 Introduction to PyTorch This is a step-by-step introduction to deep learning using the PyTorch framework. PyTorch is a great entry point into deep learning and if you have some knowledge of Python then you will find PyTorch an intuitive, productive, and enlightening experience. The ability to rapidly prototype experiments and test ideas is a core strength of PyTorch. Together with the possibility of being able to turn experiments into productive, deployable resources, the learning curve challenge is abundantly rewarded. PyTorch is a relatively easy and fun way to understand deep learning concepts. You may be surprised at how few lines of code it takes to solve common problems of classification, such as hand-writing recognition and image classification. Having said that PyTorch is easy cannot override the fact that deep learning is, in many ways, hard. It involves some complicated math and some intractable logical conundrums. This should not, however, distract from the fun and useful part of this enterprise. There is no doubt machine learning can provide deep insights and solve important problems in the world around us but to get there can take some work. This book is an attempt, not to gloss over important ideas, but to explain them in a way that is jargon free and succinct. If the idea of solving complicated differential equations makes you break out in a cold sweat, you are not alone. This might be related to some high school trauma of a bad-tempered math teacher furiously demanding you cite Euler's formula or the trigonometric identities. This is a problem because math itself should be fun, and insight arises not from the laborious memorizing of formulas but through understanding relationships and foundational concepts. Another thing that can make deep learning appear difficult is that it has a diverse and dynamic frontier of research. This may be confusing for the novice because it does not present an obvious entry point. If you understand some principles and want to test your ideas, it can be a bewildering task to find a suitable set of tools. The combinations of development language, framework, deployment architecture, and so on, present a nontrivial decision process. Introduction to PyTorch Chapter 1 The science of machine learning has matured to the point that a set of general purpose algorithms for solving problems such has classification and regression have emerged. Subsequently, several frameworks have been created to harness the power of these algorithms and use them for general problem solving. This means that the entry point is at such a level that these technologies are now in the hands of the non-computer science professional. Experts in a diverse array of domains can now use these ideas to advance their endeavors. By the end of this book, and with a little dedication, you will be able to build and deploy useful deep learning models to help solve the problems you are interested in. In this chapter, we will discuss the following topics: What is PyTorch? Installing PyTorch Basic operations Loading data What is PyTorch? PyTorch is a dynamic tensor-based, deep learning framework for experimentation, research, and production. It can be used as a GPU-enabled replacement for NumPy or a flexible, efficient platform for building neural networks. The dynamic graph creation and tight Python integration makes PyTorch a standout in deep learning frameworks. If you are at all familiar with the deep learning ecosystem, then frameworks such as Theano and TensorFlow, or higher-level derivatives such as Keras, are amongst the most popular. PyTorch is a relative newcomer to the deep learning framework set. Despite this, it is now being used extensively by Google, Twitter, and Facebook. It stands out from other frameworks in that both Theano and TensorFlow encode computational graphs in static structures that need to be run in self-contained sessions. In contrast, PyTorch can dynamically implement computational graphs. The consequence for a neural net is that the network can change behavior as it is being run, with little or no overhead. In TensorFlow and Theano, to change behavior, you effectively have to rebuild the network from scratch. [7] Introduction to PyTorch Chapter 1 This dynamic implementation comes about through a process called tape-based auto-diif, allowing PyTorch expressions to be automatically differentiated. This has numerous advantages. Gradients can be calculated on the fly and since the computational graph is dynamic, it can be changed at each function call, allowing it to be used in interesting ways in loops and under conditional calls that can respond, for example, to input parameters or intermediate results. This dynamic behavior and great flexibility has made PyTorch a favored experimental platform for deep learning. Another advantage of PyTorch is that it is closely integrated with the Python language. For Python coders, it is very intuitive and it interoperates seamlessly with other Python packages, such as NumPy and SciPy. PyTorch is very easy to experiment with. It makes an ideal tool for not only building and running useful models, but also as a way to understand deep learning principles by direct experimentation. As you would expect, PyTorch can be run on multiple graphical processing units (GPUs). Deep learning algorithms can be computationally expensive. This is especially true for big datasets. PyTorch has strong GPU support, with intelligent memory sharing of tensors between processes. This basically means there is an efficient and user-friendly way to distribute the processing load across the CPU and GPUs. This can make a big difference to the time it takes to test and run large complex models. Dynamic graph generation, tight Python language integration, and a relatively simple API makes PyTorch an excellent platform for research and experimentation. However, versions prior to PyTorch 1 had deficits that prevented it from excelling in production environments. This deficiency is being addressed in PyTorch 1. Research is an important application for deep learning, but increasingly, deep learning is being embedded in applications that run live on the web, on a device, or in a robot. Such an application may service thousands of simultaneous queries and interact with massive, dynamic data. Although Python is one of the best languages for humans to work with, specific efficiencies and optimizations are available in other languages, most commonly C++ and Java. Even though the best way to build a particular deep learning model may be with PyTorch, this may not be the best way to deploy it. This is no longer a problem because now with PyTorch 1, we can export Python free representations of PyTorch models. This has come about through a partnership between Facebook, the major stakeholder of PyTorch, and Microsoft, to create the Open Neural Network Exchange (ONNX) to assist developers in converting neural net models between frameworks. This has led to the merging of PyTorch with the more production-ready framework, CAFFE2. In CAFFE2, models are represented by a plain text schema, making them language agnostic. This means they are more easily deployed to Android, iOS, or Rasberry Pi devices. [8] Introduction to PyTorch Chapter 1 With this in mind, PyTorch version 1 has expanded its API included production-ready capabilities, such as optimizing code for Android and iPhone, a just in time (JIT) C++ compiler, and several ways to make Python free representations of your models. In summary, PyTorch has the following characteristics: Dynamic graph representation Tightly integrated with the Python programming language A mix of high-and low-level APIs Straightforward implementation on multiple GPUs Able to build Python-free model representation for export and production Scales to massive data using the Caffe framework Installing PyTorch PyTorch will run on macOS X, 64 bit Linux, and 64 bit Windows. Be aware that Windows does not currently offer (easy) support for the use of GPUs in PyTorch. You will need to have either Python 2.7 or Python 3.5 / 3.6 installed on your computer before you install PyTorch, remembering to install the correct version for each Python version. Unless you have a reason not to, it is recommended that you install the Anaconda distribution of Python. This this is available from: https://anaconda.org/anaconda/python. Anaconda includes all the dependencies of PyTorch, as well as technical, math, and scientific libraries essential to your work in deep learning. These will be used throughout the book, so unless you want to install them all separately, install Anaconda. The following is a list of the packages and tools that we will be using in this book. They are all installed with Anaconda: NumPy: A math library primarily used for working with multidimensional arrays Matplotlib: A plotting and visualization library SciPy: A package for scientific and technical computing Skit-Learn: A library for machine learning Pandas: A library for working with data IPython: A notebook-style code editor used for writing and running code in a browser [9] Introduction to PyTorch Chapter 1 Once you have Anaconda installed, you can now install PyTorch. Go to the PyTorch website at https://pytorch.org/. The installation matrix on this website is pretty self-explanatory. Simply select your operating system, Python version, and, if you have GPUs, your CUDA version, and then run the appropriate command. As always, it is good practice to ensure your operating system and dependent packages are up to date before installing PyTorch. Anaconda and PyTorch run on Windows, Linux, and macOS, although Linux is probably the most used and consistent operating system. Throughout this book, I will be using Python 3.7 and Anaconda 3.6.5 running on Linux Code in this book was written on the Jupyter Notebook and these notebooks are available from the book's website. You can either choose to set up your PyTorch environment locally on your own machine or remotely on a cloud server. They each have their pros and cons. Working locally has the advantage that it is generally easier and quicker to get started. This is especially true if you are not familiar with SSH and the Linux terminal. It is simply a matter of installing Anaconda and PyTorch, and you are on your way. Also, you get to choose and control your own hardware, and while this is an upfront cost, it is often cheaper in the long run. Once you start expanding hardware requirements, cloud solutions can become expensive. Another advantage of working locally is that you can choose and customize your integrated development envionment (IDE). In fact, Anaconda has its own excellent desktop IDE called Spyder. There are a few things you need to keep in mind when building your own deep learning hardware and you require GPU acceleration: Use NVIDIA CUDA-compliant GPUs (for example, GTX 1060 or GTX 1080) A chipset that has at least 16 PCIe lanes At least 16 GB of RAM Working on the cloud does offer the flexibility to work from any machine as well as more easily experiment with different operating systems, platforms, and hardware. You also have the benefit of being able to share and collaborate more easily. It is generally cheap to get started, costing a few dollars a month, or even free, but as your projects become more complex and data intensive, you will need to pay for more capacity. Let's look briefly at the installation procedures for two cloud server hosts: Digital Ocean and Amazon Web Services. [ 10 ] Introduction to PyTorch Chapter 1 Digital Ocean Digital Ocean offers one of the simplest entry points into cloud computing. It offers predictable simple payment structures and straightforward server administration. Unfortunately, Digital Ocean does not currently support GPUs. The functionality revolves around droplets, pre-built instances of virtual private servers. The following are the steps required to set up a droplet: 1. Sign up for an account with Digital Ocean. Go to https://www.digitalocean. com/. 2. Click on the Create button and choose New Droplet. 3. Select the Ubuntu distribution of Linux and choose the two gigabyte plan or above. 4. Select the CPU optimization if required. The default values should be fine to get started. 5. Optionally, set up public/private key encryption. 6. Set up an SSH client (for example, PuTTY) using the information contained in the email sent to you. 7. Connect to your droplet via your SSH client and curl the latest Anaconda installer. You can find the address location of the installer for your particular environment at https://repo.continuum.io/. 8. Install PyTorch using this command: conda install pytorch torchvision -c pytorch Once you have spun up your droplet, you can access the Linux command through an SSH client. From Command Prompt, you can curl the latest Anaconda installer available from: https://www.anaconda.com/download/#linux. An installation script is also available from the continuum archive at https://repo. continuum.io/archive/. Full step-by-step instructions are available from the Digital Ocean tutorials section. [ 11 ] Introduction to PyTorch Chapter 1 Tunneling in to IPython IPython is an easy and convenient way to edit code through a web browser. If you are working on a desktop computer, you can just launch IPython and point your browser to localhost:8888. This is the port that the IPython server, Jupyter, runs on. However, if you are working on a cloud server, then a common way to work with code is to tunnel in to IPython using SSH. Tunneling in to IPython involves the following steps: 1. In your SSH client, set your destination port to localhost:8888. In PuTTY, go to Connection | SSH | Tunnels. 2. Set the source port to anything above 8000 to avoid conflicting with other services. Click Add. Save these settings and open the connection. Log in to your droplet as usual. 3. Start the IPython server by typing jupyter notebook into Command Prompt of your server instance. 4. Access IPython by pointing your browser to localhost: source port; for example, localhost:8001. 5. Start the IPython server. Note that you may need a token to access the server for the first time. This is available from the command output once you start Jupyter. You can either copy the URL given in this output directly into your browser's address bar, changing the port address to your local source port address, for example: 8001, or you can elect to paste the token, the part after token=, into the Jupyter start-up page and replace it with a password for future convenience. You now should be able to open, run, and save IPython notebooks. [ 12 ] Introduction to PyTorch Chapter 1 Amazon Web Services (AWS) AWS is the original cloud computing platform, most noted for its highly-scalable architecture. It offers a vast array of products. What we need to begin is an EC2 instance. This can be accessed form the Services tab of the AWS control panel. From there, select EC2 and then Launch Instance. From here, you can choose the machine image you require. AWS provide several types of machine images specifically for deep learning. Feel free to experiment with any of these but the one we are going to use here is the deep learning AMI for Ubuntu version 10. It comes with pre-installed environments for PyTorch and TensorFlow. After selecting this, you get to choose other options. The default T2 micro with 2 GB of memory should be fine to experiment with; however, if you want GPU acceleration, you will need to use the T2 medium instance type. Finally, when you launch your instance, you will be prompted to create and download your public-private key pair. You can then use your SSH client to connect to the server instance and tunnel in to the Jupyter Notebook as per the previous instructions. Once again, check the documentation for the finer details. Amazon has a pay-per-resource model, so it is important you monitor what resources you are using to ensure you do not receive any unnecessary or unexpected charges. Basic PyTorch operations Tensors are the workhorse of PyTorch. If you know linear algebra, they are equivalent to a matrix. Torch tensors are effectively an extension of the numpy.array object. Tensors are an essential conceptual component in deep learning systems, so having a good understanding of how they work is important. In our first example, we will be looking at tensors of size 2 x 3. In PyTorch, we can create tensors in the same way that we create NumPy arrays. For example, we can pass them nested lists, as shown in the following code: [ 13 ] Introduction to PyTorch Chapter 1 Here we have created two tensors, each with dimensions of 2 x 3. You can see that we have created a simple linear function (more about linear functions in Chapter 2, Deep Learning Fundamentals) and applied it to x and y and printed out the result. We can visualize this with the following diagram: As you may know from linear algebra, matrix multiplication and addition occur elementwise so that for the first element of x, let's write this as X00. This is multiplied by two and added to the first element of y, written as Y00, giving F00 = 9. X01 = 2 and Y01 = 8 so f01 = 4 + 12. Notice that the indices start at zero. If you have never seen any linear algebra, don't worry too much about this, as we are going to brush up on these concepts in Chapter 2, Deep Learning Fundamentals, and you will get to practice with Python indexing shortly. For now, just consider our 2 x 3 tensors as tables with numbers in them. Default value initialization There are many cases where we need to initialize torch tensors to default values. Here, we create three 2 x 3 tensors, filling them with zeros, ones, and random floating point numbers: [ 14 ] Introduction to PyTorch Chapter 1 An important point to consider when we are initializing random arrays is the so-called seed of reproducibility. See what happens when you run the preceding code several times. You get a different array of random numbers each time. Often in machine learning, we need to be able to reproduce results. We can achieve this by using a random seed. This is demonstrated in the following code: Notice that when you run this code many times, the tensor values stay the same. If you remove the seed by deleting the first line, the tensor values will be different each time the code is run. It does not matter what number you use to seed the random number generator, as long as it is consistently, achieves reproducible results. Converting between tensors and NumPy arrays Converting a NumPy array is as simple as performing an operation on it with a torch tensor. The following code should make this clear: We can see the result of the type torch tensor. In many cases, we can use NumPy arrays interchangeably with tensors and always be sure the result is a tensor. However, there are times when we need to explicitly create a tensor from an array. This is done with the torch.from_numpy function: [ 15 ] Introduction to PyTorch Chapter 1 To convert from a tensor to a NumPy array, simply call the torch.numpy() function: Notice that we use Python's built-in type() function, as in type(object), rather than the tensor.type() we used previously. The NumPy arrays do not have a type attribute. Another important thing to understand is that NumPy arrays and PyTorch tensors share the same memory space. For example, see what happens when we change a variables value as demonstrated by the following code: [ 16 ] Introduction to PyTorch Chapter 1 Note also that when we print a tensor, it returns a tuple consisting of the tensor itself and also its dtype, or data type attribute. It's important here because there are certain dtype arrays that cannot be turned into tensors. For example, consider the following code: This will generate an error message telling us that only supported dtype are able to be converted into tensors. Clearly, int8 is not one of these supported types. We can fix this by converting our int8 array to an int64 array before passing it to torch.from_numpy. We do this with the numpy.astype function, as the following code demonstrates: It is also important to understand how numpy dtype arrays convert to torch dtype. In the previous example, numpy int32 converts to IntTensor. The following table lists the torch dtype and their numpy equivalents: Numpy type int64 dtype torch.int64 torch.float Torch type LongTensor Description 64 bit integer int32 torch.int32 torch.int IntegerTensor 32 bit signed integer uint8 torch.uint8 ByteTensor 8 bit unsigned integer float64 double torch.float64 torch.double DoubleTensor 64 bit floating point float32 torch.float32 torch.float FloatTensor 32 bit floating point torch.int16 torch.short torch.int8 ShortTensor 16 bit signed integer 6 bit signed integer CharTensor The default dtype for tensors is FloatTensor; however, we can specify a particular data type by using the tensor's dtype attribute. For an example, see the following code: [ 17 ] Introduction to PyTorch Chapter 1 Slicing and indexing and reshaping torch.Tensor have most of the attributes and functionality of NumPy. For example, we can slice and index tensors in the same way as NumPy arrays: Here, we have printed out the first element of x, written as x0, and in the second example, we have printed out a slice of the second element of x; in this case, x11 and x12. If you have not come across slicing and indexing, you may want to look at this again. Note that indexing begins at 0, not 1, and we have kept our subscript notation consistent with this. Notice also that the slice [1][0:2] is the elements x10 and x11, inclusive. It excludes the ending index, index 2, specified in the slice. We can can create a reshaped copy of an existing tensor using the view() function. The following are three examples: [ 18 ] Introduction to PyTorch Chapter 1 It is pretty clear what (3,2) and (6,1) do, but what about the –1 in the first example? This is useful if you know how many columns you require, but do not know how many rows this will fit into. Indicating –1 here is telling PyTorch to calculate the number of rows required. Using it without another dimension simply creates a tensor of a single row. You could rewrite example two mentioned previously, as follows, if you did not know the input tensor's shape but know that it needs to have three rows: An important operation is swapping axes or transposing. For a two-dimensional tensor, we a can use tensor.transpose(), passing it the axis we want to transpose. In this example, the original 2 x 3 tensor becomes a 3 x 2 tensor. The rows simply become the columns: In PyTorch, transpose() can only swap two axes at once. We could use transpose in multiple steps; however, a more convenient way is to use permute(), passing it the axes we want to swap. The following example should make this clear: [ 19 ] Introduction to PyTorch Chapter 1 When we are considering tensors in two dimensions, we can visualize them as flat tables. When we move to higher dimensions, this visual representation becomes impossible. We simply run out of spatial dimensions. Part of the magic of deep learning is that it does not matter much in terms of the mathematics involved. Real-world features are each encoded into a dimension of a data structure. So, we may be dealing with tensors of potentially thousands of dimensions. Although it might be disconcerting, most of the ideas that can be illustrated in two or three dimensions work just as well in higher dimensions. In place operations It is important to understand the difference between in place and assignment operations. When, for example, we use transpose(x), a value is returned but the value of x does not change. In all the examples up until now, we have been performing operations by assignment. That is, we have been assigning a variable to the result of an operation, or simply printing it to the output, as in the preceding example. In either case, the original variable remains untouched. Alternatively, we may need to apply an operation in place. We can, of course, assign a variable to itself, such as in x = x.transpose(0,1); however, a more convenient way to do this is with in place operations. In general, in place operations in PyTorch have a trailing underscore. For an example, see the following code: [ 20 ] Introduction to PyTorch Chapter 1 As another example, here is the linear function we started this chapter with using in place operations on y: Loading data Most of the time you will spend on a deep learning project will be spent working with data and one of the main reasons that a deep learning project will fail is because of bad, or poorly understood data. This issue is often overlooked when we are working with wellknown and well-constructed datasets. The focus here is on learning the models. The algorithms that make deep learning models work are complex enough themselves without this complexity being compounded by something that is only partially known, such as an unfamiliar dataset. Real-world data is noisy, incomplete, and error prone. These axes of confoundedness mean that if a deep learning algorithm is not giving sensible results, after errors of logic in the code are eliminated, bad data, or errors in our understanding of the data, are the likely culprit. So putting aside our wrestle with data, and with an understanding that deep learning can provide valuable real-world insights, how do we learn deep learning? Our starting point is to eliminate as many of the variables that we can. This can be achieved by using data that is well known and representative of a specific problem; say, for example, classification. This enables us to have both a starting point for deep learning tasks, as well as a standard to test model ideas. [ 21 ] Introduction to PyTorch Chapter 1 One of the most well-known datasets is the MNIST dataset of hand-written digits, where the usual task is to correctly classify each of the digits, from zero through nine. The best models get an error rate of around 0.2%. We could apply this well-performing model with a few adjustments, to any visual classification task, with varying results. It is unlikely we will get results anywhere near 0.2% and the reason is because the data is different. Understanding how to tweek a deep learning model to take into account these sometimes subtle differences in data, is one of the key skills of a successful deep learning practitioner. Consider an image classification task of facial recognition from color photographs. The task is still classification but the differences in that data type and structure dictate how the model will need to change to take this into account. How this is done is at the heart of machine learning. For example, if we are working with color images, as opposed to black and white images, we will need two extra input channels. We will also need output channels for each of the possible classes. In a handwriting classification task, we need 10 output channels; one channel for each of the digits. For a facial recognition task, we would consider having an output channel for each target face (say, for criminals in a police database). Clearly, an important consideration is data types and structures. The way image data is structured in an image is vastly different to that of, say, an audio signal, or output from a medical device. What if we are trying to classify people's names by the sound of their voice, or classify a disease by its symptoms? They are all classification tasks; however, in each specific case, the models that represent each of these will be vastly different. In order to build suitable models in each case, we will need to become intimately acquainted with the data we are using. It is beyond the scope of this book to discuss the nuances and subtleties of each data type, format, and structure. What we can do is give you a brief insight into the tools, techniques, and best practice of data handling in PyTorch. Deep learning datasets are often very large and it is an important consideration to see how they are handled in memory. We need to be able to transform data, output data in batches, shuffle data, and perform many other operations on data before we feed it to a model. We need to be able to do all these things without loading the entire dataset into memory, since many datasets are simply too large. PyTorch takes an object approach when working with data, creating class objects for each specific activity. We will examine this in more detail in the coming sections. [ 22 ] Introduction to PyTorch Chapter 1 PyTorch dataset loaders Pytorch includes data loaders for several datasets to help you get started. The torch.dataloader is the class used for loading datasets. The following is a list of the included torch datasets and a brief description: MNIST FashionMNIST EMNIST COCO LSUN Imagenet-12 CIFAR STL10 SVHN PhotoTour Handwritten digits 1–9. A subset of NIST dataset of handwritten characters. Contains a training set of 60,000 test images and a test set of 10,000. A drop-in dataset for MNIST. Contains images of fashion items; for example, T-shirt, trousers, pullover. Based on NIST handwritten characters, including letters and numbers and split for 47, 26, and 10 class classification problems. Over 100,000 images classified into everyday objects; for example, person, backpack, and bicycle. Each image can have more than one class. Used for large-scale scene classification of images; for example, bedroom, bridge, church. Large-scale visual recognition dataset containing 1.2 million images and 1,000 categories. Implemented with ImageFolder class, where each class is in a folder. 60,000 low-res (32 32) color images in 10 mutually exclusive classes; for example, airplane, truck, and car. Similar to CIFAR but with higher resolution and larger number of unlabeled images. 600,000 images of street numbers obtained from Google Street View. Used for recognition of digits in real-world settings. Learning Local Image descriptors. Consists of gray scale images composed of 126 patches accompanied with a descriptor text file. Used for pattern recognition. [ 23 ] Introduction to PyTorch Chapter 1 Here is a typical example of how we load one of these datasets into PyTorch: CIFAR10 is a torch.utils.dataset object. Here, we are passing it four arguments. We specify a root directory relative to where the code is running, a Boolean, train, indicating if we want the test or training set loaded, a Boolean that, if set to True, will check to see if the dataset has previously been downloaded and if not download it, and a callable transform. In this case, the transform we select is ToTensor(). This is an inbuilt class of torchvision.transforms that makes the class return a tensor. We will discuss transforms in more detail later in the chapter. The contents of the dataset can be retrieved by a simple index lookup. We can also check the length of the entire dataset with the len function. We can also loop through the dataset in order. The following code demonstrates this: [ 24 ] Introduction to PyTorch Chapter 1 Displaying an image The CIFAR10 dataset object returns a tuple containing an image object and a number representing the label of the image. We see from the size of the image data, that each sample is a 3 x 32 x 32 tensor, representing three color values for each of the 322 pixels in the image. It is important to know that this is not quite the same format used for matplotlib. A tensor treats an image in the format of [color, height, width], whereas a numpy image is in the format [height, width, color]. To plot an image, we need to swap axes using the permute() function, or alternatively convert it to a NumPy array and using the transpose function. Note that we do not need to convert the image to a NumPy array, as matplotlib will display the correctly permuted tensor. The following code should make this clear: DataLoader We will see that in a deep learning model, we may not always want to load images one at a time or load them in the same order each time. For this, and other reasons, it is often better to use the torch.utils.data.DataLoader object. DataLoader provides a multipurpose iterator to sample the data in a specified way, such as in batches, or shuffled. It is also a convenient place to assign workers in multiprocessor environments. In the following example, we sample the dataset in batches of four samples each: [ 25 ] Introduction to PyTorch Chapter 1 Here DataLoader returns a tuple of two tensors. The first tensor contains the image data of all four images in the batch. The second tensor are the images labels. Each batch consists of four image label, pairs, or samples. Calling next() on the iterator generates the next set of four samples. In machine learning terminology, each pass over the entire dataset is called an epoch. This technique is used extensively, as we will see to train and test deep learning models. Creating a custom dataset The Dataset class is an abstract class representing a dataset. Its purpose is to have a consistent way of representing the specific characteristics of a dataset. When we are working with unfamiliar datasets, creating a Dataset object is a good way to understand and represent the structure of the data. It is used with a data loader class to draw samples from a dataset in a clean and efficient manner. The following diagram illustrates how these classes are used: Common actions we perform with a Dataset class include checking the data for consistency, applying transform methods, dividing the data into training and test sets, and loading individual samples. In the following example, we are using a small toy dataset consisting of images of objects that are classified as either toys or not toys. This is representative of a simple image classification problem where a model is trained on a set of labeled images. A deep learning model will need the data with various transformations applied in a consistent manner. Samples may need to be drawn in batches and the dataset shuffled. Having a framework for representing these data tasks greatly simplifies and enhances deep learning models. [ 26 ] Introduction to PyTorch Chapter 1 The complete dataset is available at http://www.vision.caltech.edu/pmoreels/Datasets/ Giuseppe_Toys_03/. For this example, I have created a smaller subset of the dataset, together with a labels.csv file. This is available in the data/GiuseppeToys folder in the GitHub repository for this book. The class representing this dataset is as follows: [ 27 ] Introduction to PyTorch Chapter 1 The __init__ function is where we initialize all the properties of the class. Since it is only called once when we first create the instance to do all the things, we perform all the housekeeping functions, such as reading CSV files, setting the variables, and checking data for consistency. We only perform operations that occur across the entire dataset, so we do not download the payload (in this example, an image), but we make sure that the critical information about the dataset, such as directory paths, filenames, and dataset labels are stored in variables. The __len__ function simply allows us to call Python's built-in len() function on the dataset. Here, we simply return the length of the list of label tuples, indicating the number of images in the dataset. We want to make sure that stays as simple and reliable as possible because we depend on it to correctly iterate through the dataset. The __getitem__ function is an built-in Python function that we override in our Dataset class definition. This gives the Dataset class the functionality of Python sequence types, such as the use of indexing and slicing. This method gets called often—every time we do an index lookup—so make sure it only does what it needs to do to retrieve the sample. To harness this functionality into our own dataset, we need to create an instance of our custom dataset as follows: Transforms As well as the ToTensor() transform, the torchvision package includes a number of transforms specifically for Python imaging library images. We can apply multiple transforms to a dataset object using the compose function as follows: [ 28 ] Introduction to PyTorch Chapter 1 Compose objects are essentially a list of transforms that can then be passed to the dataset as a single variable. It is important to note that the image transforms can only be applied to PIL image data, not tensors. Since transforms in a compose are applied in the order that they are listed, it is important that the ToTensor transform occurs last. If it is placed before the PIL transforms in the Compose list, an error will be generated. Finally, we can check that it all works by using DataLoader to load a batch of images with transforms, as we did before: ImageFolder We can see that the main function of the dataset object is to take a sample from a dataset, and the function of DataLoader is to deliver a sample, or a batch of samples, to a deep learning model for evaluation. One of the main things to consider when writing our own dataset object is how do we build a data structure in accessible memory from data that is organized in files on a disk. A common way we might want to organize data is in folders named by class. Let's say that, for this example, we have three folders named toy, notoy, and scenes, contained in a parent folder, images. Each of these folders represent the label of the files contained within them. We need to be able to load them while retaining them as separate labels. Happily, there is a class for this, and like most things in PyTorch, it is very easy to use. The class is torchvision.datasets.ImageFolder and it is used as follows: [ 29 ] Introduction to PyTorch Chapter 1 Within the data/GiuseppeToys/images folder, there are three folders, toys, notoys, and scenes, containing images with their folder names indicating labels. Notice that the retrieved labels using DataLoader are represented by integers. Since, in this example, we have three folders, representing three labels, DataLoader returns integers 1 to 3, representing the image labels. Concatenating datasets It is clear that the need will arise to join datasets—we can do this with the torch.utils.data.ConcatDataset class. ConcatDataset takes a list of datasets and returns a concatenated dataset. In the following example, we add two more transforms, removing the blue and green color channel. We then create two more dataset objects, applying these transforms and, finally, concatenating all three datasets into one, as shown in the following code: Summary In this chapter, we have introduced some of the features and operations of PyTorch. We gave an overview of the installation platforms and procedures. You have hopefully gained some knowledge of tensor operations and how to perform them in PyTorch. You should be clear about the distinction between in place and by assignment operations and should also now understand the fundamentals of indexing and slicing tensors. In the second half of this chapter, we looked at loading data into PyTorch. We discussed the importance of data and how to create a dataset object to represent custom datasets. We looked at the inbuilt data loaders in PyTorch and discussed representing data in folders using the ImageFolder object. Finally, we looked at how to concatenate datasets. [ 30 ] Introduction to PyTorch Chapter 1 In the next chapter, we will take a whirlwind tour of deep learning fundamentals and their place in the machine learning landscape. We will get you up to speed with the mathematical concepts involved, including looking at linear systems and common techniques for solving them. [ 31 ] 2 Deep Learning Fundamentals Deep learning is generally considered a subset of machine learning, involving the training of artificial neural networks (ANNs). ANNs are at the forefront of machine learning. They have the ability to solve complex problems involving massive amounts of data. Many of the principles of machine learning generally are also important in deep learning specifically, so we will spend some time reviewing these here. In this chapter, we will discuss the following topics: Approaches to machine learning Learning tasks Features Models Artificial neural networks Deep Learning Fundamentals Chapter 2 Approaches to machine learning Prior to general machine learning, if we wanted to, for example, build a spam filter, we could start by compiling a list of words that commonly appear in spam. The spam detector then scans each email and when the number of blacklisted words reaches a threshold, the email would be classified as spam. This is called a rules-based approach, and is illustrated in the following diagram: The problem with this approach is that once the writers of spam know the rules, they are able to craft emails that avoid this filter. The people with the unenviable task of maintaining this spam filter would have to continually update the list of rules. With machine learning, we can effectively automate this rule-updating process. Instead of writing a list of rules, we build and train a model. As a spam detector, it will be more accurate since it can analyze large volumes of data. It is able to detect patterns in data that would be impossible for a human to do in a meaningful timeframe. The following diagram illustrates this approach: [ 33 ] Deep Learning Fundamentals Chapter 2 There are a large number of ways that we can approach machine learning and these approaches are broadly characterized by the following factors: Whether or not models are trained with labelled training data. There are several possibilities here, including entirely supervised, semi-supervised, based on reinforcement, or entirely unsupervised. Whether they are online (that is, learning on the fly as new data is presented), or learn using pre-existing data. This is referred to as batch learning. Whether they are instance-based, simply comparing new data to known data, or model-based, involving the detection of patterns and building a predictive model. These approaches are not mutually exclusive and most algorithms are a combination of the approaches. For example, a typical way to build a spam detector is using an online, modelbased supervised learning algorithm. Learning tasks There are several distinct types of learning tasks that are partially defined by the type of data that they work on. Based on this, we can divide learning tasks into two broad categories: Unsupervised learning: Data is unlabeled so the algorithm must infer a relationship between variables or by finding clusters of similar variables Supervised learning: Uses a labeled dataset to build an inferred function that can be used to predict the label of an unlabeled sample Whether the data is labeled or not has a predetermining effect on the way a learning algorithm is built. [ 34 ] Deep Learning Fundamentals Chapter 2 Unsupervised learning One of the main drawbacks to supervised learning is that it requires data that is accurately labeled. Most real-world data consists of unlabeled and unstructured data and this is the major challenge to machine learning and the broader endeavor of artificial intelligence. Unsupervised learning plays an important role in finding structure in unstructured data. The division between supervised and unsupervised learning is not absolute. Many unsupervised algorithms are used to together with supervised learning; for example, where data is only partially labeled or when we are trying to find the most important features of a deep learning model. Clustering This is the most straightforward unsupervised method. In many cases, it does not matter that the data is unlabeled; what we are interested in is the fact that the data clusters around certain points. Recommender systems that, say, recommend movies or books from an online store often use clustering techniques. An approach here is for an algorithm to analyze a customer's purchase history, comparing it to other customers, and making recommendations based on similarities. The algorithm clusters customers' usage patterns into groups. At no time does the algorithm know what the groups are; it is able to work this out for itself. One of the most used clustering algorithms is k-means. This algorithm works by establishing cluster centers based on the mean of the observed samples. Principle component analysis Another unsupervised method, often used in conjunction with supervised learning, is principle component analysis (PCA). This is used when we have a large amount of features that may be correlated and we are unsure of the impact each feature has in determining a result. For example, in weather prediction, we could use each meteorological observation as a feature and feed them directly to a model. This means the model would have to analyze a large amount of data, much of it irrelevant. Further, the data may be correlated so that we need to consider not just individual features but how these features interact with each other. What we need is a tool that will reduce this large amount of possibly correlated and redundant features into a small number of principle components. PCA belongs to a type of algorithm called dimensionality reduction because this reduces the number of dimensions in the input dataset. [ 35 ] Deep Learning Fundamentals Chapter 2 Reinforcement learning Reinforcement learning is somewhat different to other methods and is often classified as an unsupervised method because the data it uses is not labeled in the supervised sense. Reinforcement learning probably comes closer to the way humans interact and learn from the world than other methods. In reinforcement learning, the learning system is called an agent and this agent interacts with an environment by observation and by performing actions. Each action results in either a reward or a penalty. The agent must develop a strategy or policy to maximize reward and minimize penalties over time. Reinforcement learning has applications in many domains, such as game theory and robotics where the algorithm must learn its environment without direct human prompting. Supervised learning In supervised learning, a machine learning model is trained on a labeled dataset. Most successful deep learning models so far have been focused on supervised learning tasks. With supervised learning, each data instance (say, an image or an email), comes with two elements: a set of features, usually denoted as an uppercase X, and a label, denoted with a lower case, y. Sometimes, the label is called the target or answer. Supervised learning is usually conducted in two stages: a training phase when the model learns the characteristics of the data, and a testing phase, where predictions are made on unlabeled data. It is important that the model is trained and tested on separate datasets, since the goal is to generalize to new data and not precisely learn the characteristics of a single dataset. This can lead to the common problems of over overfitting the training set, and consequently underfitting a test set of data. Classification Classification is probably the most common supervised machine learning task. There are several types of classification problems based the number of input and output labels. The task of a classification model is to find a pattern in the input features and associate this pattern with a label. A model should learn the distinguishing features of the data and then be able to predict the label of an unlabeled sample. The model essentially builds an inferred function from the training data. We will look at how this function is built shortly. We can distinguish three types of classification models: Binary classification: As in our toy—no toy example, this involves distinguishing between two labels. [ 36 ] Deep Learning Fundamentals Chapter 2 Multi-label classification: Involves distinguishing between more than two classes. For example, if the toy example was extended to distinguish between the types of toy in the image (car, truck, plane, and so on). A common way to solve multi-label classification problems is to divide the problem into multiple binary problems. Multiple output classification: Each sample may have more than one output label. For example, perhaps the task is to analyze images of scenes and determine what type of toys are in them. Each image can have multiple types of toys and therefore has multiple labels. Evaluating classifiers You may think that the best way to measure the performance of a classifier is to count the proportion of successful predictions compared with the total predictions made. However, consider a classification task on a dataset of handwritten digits, where the target is all the digits that are not 7. Just guessing that every sample is not 7 will give a success rate, assuming the data is evenly distributed, of 90%. When evaluating classifiers, we must consider four variables: TP true positive: The predictions that correctly identify a target TN true negative: The predictions that correctly identify a non-target FP false positive: Predictions that incorrectly identify a target FN false negative: Predictions that incorrectly identify a non-target Two metrics, precision and recall, are commonly used together to measure the performance of a classifier. Precision is defined by the following equation: Recall is defined by the following equation: We can combine these ideas in what is known as a confusion matrix. It is called confusion matrix, not because it is confusing to understand, but because it tabulates instances where the classifier confuses targets. The following diagram should make this clearer: [ 37 ] Deep Learning Fundamentals Chapter 2 Which measure we use, or give more weight in determining the success or not of a classifier, really depends on the application. There is a trade-off between precision and recall. Improving precision will often result in a reduction in recall. For example, increasing the number of true positives will often mean that the false positive rate is increased. The right balance of precision and recall depends on the requirements of the application. For example, in a medical test for cancer, we probably need higher precision, since a false negative means an instance of cancer remains undiagnosed, with potentially fatal consequences. Features It is important to remember that an image detection model does not see an image but a set of pixel color values, or, in the case of a spam filter, a collection of characters in an email. These are raw features of the model. An important part of machine learning is feature transformation. A feature transformation we have already discussed is dimensionality reduction in regard to principle component analysis. The following is a list common feature transformations: Dimensionality reduction to reduce the number of features using techniques such as PCA Scaling or normalizing features to be within a particular numerical range Transforming the feature data type (for example, assigning categories to numbers) Adding random or generated data to augment features [ 38 ] Deep Learning Fundamentals Chapter 2 Each feature is encoded on to a dimension of our input tensor, X, so in order to make a learning model as efficient as possible, the number of features needs to be minimised. This is where principle component analysis and other dimensionality reduction techniques come in to play. Another important feature transformation is scaling. Most machine learning models do not perform well when features are of different scales. There are two common techniques used for feature scaling: Normalization or min-max scaling: Values are shifted and re-scaled to be between zero and one. This is the most used scaling method for neural networks. Standardization: Subtracts the mean and divides by the variance. This does not bound variables to a particular range, but the resultant distribution has unit variance. Handling text and categories What do we do when a feature is a set of categories rather than a number? Suppose we are building a model to predict house prices. A feature of this model could be the cladding material of the house, with possible values such as timber, iron, and cement. How can we encode this feature to be of use to a deep learning model? The obvious solution is to simply assign a real number to each category: say, 1 for timber, 2 for iron, and 3 for cement. The problem with this representation is that it infers that the category values are ordered. That is, timber and iron are somehow closer than timber and cement. A solution that avoids this is one-hot encoding. The feature values are encoded as binary vectors, as shown in the following table: Timber Iron Cement 1 0 0 0 1 0 0 0 1 This solution works well when the number of category values is small. If, for example, the data is a corpus of text, and our task is natural language processing, using one-hot encoding is not practical. The number of category values, and therefore the length of the feature vector, is the number of words in the vocabulary. In this case, the feature vector becomes large and unmanageable. [ 39 ] Deep Learning Fundamentals Chapter 2 One-hot encoding uses what is called a sparse representation. Most of the values are 0. As well as not scaling very well, one-hot encoding has another serious drawback for natural language processing. It does not encode a word's meaning, or its relationship to other words. An approach we can use is called dense word embedding. Each word in a vocabulary is represented by a real numbered vector, representing a score for a particular attribute. The general idea is that this vector encodes semantic information relevant to the task at hand. For example, if the task is to analyze movie reviews and determine the genre of the movie based on a review, we could create word embedding, as shown in the following table: Word Funny Action Suspense Drama -4 3.5 4.5 Comedy 4.5 2.5 1.5 Documentary 0 2 3 Here, the leftmost column lists words that may be present in a movie review. Each word is given a score relative to how often it appears in a review of the respective genre. We could build such a table from a supervised learning task that analyzes movie reviews in conjunction with their labeled genre. This trained model could then be applied to nonlabeled reviews to determine the most likely genre. Models Choosing a model representation is an important task in machine learning. So far, we have been referring to models as black boxes. Some data is put in, and, based on training, the model makes a prediction. Before we look inside this black box, let's review some of the linear algebra that we will need to understand deep learning models. Linear algebra review Linear algebra is concerned with the representation of linear equations through the use of matrices. In the algebra taught in high school, we were concerned with scalar, that is, single number, values. We have equations, and rules for manipulating these equations, so that they can be evaluated. The same is true when, instead of scalar values, we use matrices. Let's review some of the concepts involved. [ 40 ] Deep Learning Fundamentals Chapter 2 A matrix is simply a rectangular array of numbers. We saw that we added two matrices simply by adding each corresponding element. A matrix can be multiplied by a scalar by simply multiplying every element in the array by the scalar, as shown in the following example: This is an example of matrix addition and, as you would expect, you can perform matrix subtraction in the same way, except of course, rather than add corresponding elements, you subtract them. Note that we can only add or subtract matrices of the same size. Another common matrix operation is multiplication by a scalar: Notice the indexing style we use: Xij, where i refers to the row and j refers to the column. There are two conventions when it comes to indexing. Here, I am using zero indexing; that is, indexing starts at zero. This is to keep it consistent with the way we index tensors in PyTorch. Be aware that in some mathematical texts, and depending on what programming language you use, indexing may start at 1. Also, we refer to the size, or the dimension of a matrix, as m by n, where m is the number of rows and n is the number of columns. For example, A and B are both 3 x 2 matrices. There is a special case of a matrix called a vector. This is simply a n by 1 matrix, so it has one column and any number of rows, as shown in the following example: [ 41 ] Deep Learning Fundamentals Chapter 2 Let's now look at how to multiply a vector with a matrix. In the following example, we multiply a 3 x 2 matrix with a vector of size 2: A concrete example may make this clearer: Note that here, the 3 x 2 matrix results in a 3 vector and, in general, an m row matrix multiplied by a vector will result in an m-sized vector. We can also multiply matrices with other matrices by combining matrix vector multiplication, as shown in the following example: Here: [ 42 ] Deep Learning Fundamentals Chapter 2 Another way of understanding this is that we obtain the first column of matrix C by multiplying matrix A with a vector comprising of the first column of matrix B. We obtain the second column of matrix C by multiplying matrix A with a vector obtained from the second column of matrix B. Let's look at a concrete example: It is important to understand that we can only multiply two matrices if the number of rows in A is equal to the number of columns in B. The resultant matrix will always have the same number of rows as A and the same number of columns as B. Note that matrix multiplication is not commutative; as in the following: Matrix multiplication is, however, associative, as shown in the following example: Matrices are useful because we can represent a large number of operations with relatively simple equations. There are two matrix operations that are particularly important for machine learning: Transpose Inverse To transpose a matrix, we simply swap the columns and rows, as shown in the following example: [ 43 ] Deep Learning Fundamentals Chapter 2 Finding the inverse of a matrix is a little more complicated. In the set of real numbers, the numbers 1 plays the role of identity. That is to say, the number 1 multiplied by any other number that equals that number. Also, almost every number has an inverse; that is, a number that when it is multiplied by itself equals 1. For example, the inverse of 2 is 0.5 because two times 0.5 equals 1. It turns out an equivalent idea holds for matrices and tensors. The identity matrix consists of 1s along its primary diagonal and zeros everywhere else, as shown in the following 3 x 3 example: The identity matrix is the result when we multiply a matrix that is inverse. We write this in the following way: Importantly, we can only find the inverse of a square matrix. It is not expected to calculate inverse matrices, or indeed any matrix operation, by hand. That is what computers are good at. Inverting matrices is a non-trivial operation and they are, even for a computer, computationally expensive. Linear models The simplest models we will encounter in machine learning are linear models. Solving linear models is important in many different settings, and they form the building blocks of many nonlinear techniques. With a linear model, we attempt to fit training data to a linear function, sometimes called the hypothesis function. This is done through a process called linear regression. The hypothesis function for single variable linear regression has the following form: Here, θ0 and θ1 are the model parameters and x is the single independent variable. For our house price example, x could represent the size of floor space and h(x) could represent the predicted house price. For simplicity, we will begin by looking at just the single variable, or single feature case. [ 44 ] Deep Learning Fundamentals Chapter 2 In the following diagram, we show a number of points, representing training data, and an attempt to fit a straight line to these points: Here, x is the single feature and θ0 and θ1 represent the intercept and the slope of the hypothesis function, respectively. Our aim is to find values for θ0 and θ1, the model parameters, which will give us our line of best fit in the preceding diagram. In this diagram, θ0 is set to 1 and θ1 is set to 0.5. Therefore, its intercept is 1 and the line has a slope of 0.5. We can see that most of the training points lie above the line and a few lower-valued points lie below the line. We could guess that θ1 is probably slightly too low, since the training points appear to have a slightly steeper slope. Also, θ0 is too high, since there are two data points below the line on the left and the intercept appears to be slightly lower than 1. It is clear that we need a formal approach to finding the error in the hypothesis function. This is done through what is known as a cost function. The cost function measures the total error between the values given by the hypotheses function and the actual values in the data. Essentially, the loss function sums each point's distance from the hypothesis. The cost function is sometimes called the mean squared error (MSE). This is expressed by the following equation: [ 45 ] Deep Learning Fundamentals Chapter 2 Here, hθ(xi) is the value calculated by the hypothesis for the ith training sample, and yi is its actual value. The difference is squared as statistical convenience, since it ensures the result is always positive. Squaring also adds more weight to larger differences; that is, it places greater importance on outliers. This sum is then divided by m, the number of training samples, to calculate the mean. Here, the sum is also divided by two to make subsequent math a little more straightforward. The final part is to adjust the parameter values so that the hypothesis function fits the training data as closely as possible. We need to find parameter values that minimize the error. There are two ways we can do this: Using gradient descent to iterate over the training set and adjust parameters to minimize a cost function Directly computing model parameters, using a closed-form equation Gradient descent Gradient descent is a general-purpose optimization algorithm that has a wide variety of applications. Gradient descent minimizes the cost function by iteratively adjusting the model parameters. Gradient descent works by taking the partial derivative of the cost function. If we plot the cost function against a parameter value, it forms a convex function, as shown in the following diagram: [ 46 ] Deep Learning Fundamentals Chapter 2 You can see that as we vary θ, from right to left in the preceding diagram, the cost, Jθ, decreases to a minimum and then rises. The aim is that on each iteration of gradient descent, the cost moves closer to the minimum, and then stops once it reaches this minimum. This is achieved using the following update rule: Here, α is the learning rate, a settable hyperparameter. It is called a hyperparameter to distinguish it from the model parameters, theta. The partial derivative term is the slope of the cost function and this needs to be calculated for both theta 0 and theta 1. You can see that when the derivative, and therefore the slope, is positive, a positive value is subtracted from the previous value of theta, moving from right to left in the preceding diagram. Alternatively, if the slope is negative, theta increases, moving from right to left. Also, at the minimum, the slope is zero, so gradient descent will stop. This is exactly what we want, since no matter where we start gradient descent, the update rule is guaranteed to move theta toward the minimum. Substituting the cost function into the preceding equation and then taking the derivative for both values of theta results in the following two update rules, results in the following equations: On iteration and subsequent updates, theta will converge to values that minimize the cost function, resulting in the best fit straight line to the training data. There are two things that need to be considered. Firstly, the initialization values of theta; that is, where we start gradient descent. In most cases, random initialization works best. The other thing we need to consider is setting the learning rate, alpha (α). This is a number between zero and one. If the learning rate is set too high, then it will likely overshoot the minima. If it is set too low, then it will take too long to converge. It may take some experimentation with the particular model being used; in deep learning, an adaptive learning rate is often used for best results. This is where the learning rate changes, usually getting smaller, on each iteration of gradient descent. [ 47 ] Deep Learning Fundamentals Chapter 2 The type of gradient descent we have discussed so far is called batch gradient descent (BGD). This refers to the fact that on each update, the entire training set is used. This means that as the training set gets large, batch gradient descent becomes increasingly slow. On the other hand, batch gradient descent scales much better when there are a large number of features, so it is most often used when there is a smaller training set with a large number of features. An alternative to batch gradient descent is stochastic gradient descent (SGD). Instead of calculating the gradient using the entire training set, SGD calculates the gradient using a single sample chosen randomly on each iteration. The advantage of SGD is that the entire training set does not have to reside in memory, since on each iteration it works with one instance only. Because stochastic gradient descent chooses samples at random, its behavior is a little less regular than BGD. With batch gradient descent, every iteration smoothly moves the error (Jθ) toward the minima. With SGD, every iteration does not necessarily move the cost closer to the minima. It tends to jump around a bit, moving toward the minima only on average over a number of iterations. This means that it may jump around close to the minima but never actually reach it by the time it completes its iterations. The random nature of SGD can be used to advantage when there is more than one minima, since it may be able to jump out of this local minima and find the global minima. For example, consider the following cost function: If batch gradient descent were to begin to the right of the Local Minimum, it would not be able to find the Global Minimum. Fortunately, the cost function for linear regression is always going to be a convex function with a single minima. However, this is not always the case, particularly with neural networks, where the cost function can have a number of local minima. [ 48 ] Deep Learning Fundamentals Chapter 2 Multiple features In a realistic example, we would have more than one feature and each feature has an associated parameter value that requires fitting. We write the hypothesis function for multiple features as follows: Here, x0 is called the bias variable and is set to one, x1 to xn are the feature values, and n is the total number of features. Notice that we can write a vectorized version of the hypothesis function. Here, θ is the parameter vector and x is the feature vector. The cost function is still basically the same as the single feature case; we are just summing the error. We do, however, need to adjust the gradient descent rules and be clear about the required notation. In the update rules for gradient descent for a single feature, we used the notation for parameter values θ0 and θ1. For the multiple feature version, we simply wrap these parameters' values and their associated features into vectors. The parameter vector is notated as θj, where the subscript j refers to the feature and is an integer between 1 and n, where n is the number of features. There needs to be a separate update rule for each parameter. We can generalize these rules as follows: There is an update rule for each parameter; so, for example, the update rule for the parameter for feature j = 1 would be the following: The variables x(i) and y(i) refer, as in the single feature example, to the predicted value and actual value of the ith training sample, respectively. In the multiple feature case, however, instead of being single values, they are now vectors. The value xj(i) refers to feature j of training sample i, and m is the total number of samples in the training set. [ 49 ] Deep Learning Fundamentals Chapter 2 The normal equation For some linear regression problems, the closed form solution, also known as the normal equation, is a better way to find optimum values of theta. If you know calculus, then to minimize the cost function, you can find the partial derivatives of the cost function, with respect to each value of theta, and set each derivative to zero and then solve for each value of theta. Don't worry if you are not familiar with calculus; it turns out that we can derive the normal equation from these partial derivatives and this results in the following equation: You may wonder why we need to bother with gradient descent and the added complications this entails, since the normal equation allows us to compute the parameters in one step. The reason is that the computational effort required to invert a matrix is not insignificant. When a feature matrix X becomes large (and remember X is a matrix holding all the values of features for every training sample), then finding the inverse of this matrix simply takes too long. Even though gradient descent involves many iterations, it is still faster than the normal equation for large datasets. An advantage of the normal equation is that, unlike gradient descent, it does not expect features to be of the same scale. Another advantage of the normal equation is that is not necessary to choose a learning rate. Logistic regression We can use the linear regression model to perform binary classification by finding a decision boundary that divides two predicted classes. A common way to do this is by using a sigmoid function, defined as the following: [ 50 ] Deep Learning Fundamentals Chapter 2 The plot of the sigmoid function looks like this: The sigmoid function can be used in the hypothesis function, to output a probability, as follows: Here, the output of the hypothesis function is the probability y = 1 given x parameterized by theta. To decide when to predict y = 0 or y = 1, we can use the following two rules: The characteristics of the sigmoid function (that is, having asymptotes at 0 and 1, and having a value of 0.5 at z = 0), has some attractive properties for use with logistic regression problems. Notice that the decision boundary is a property of the parameters of the model, not the training set. We still need to fit the parameters so that the cost, or the error, is minimized. To do this, we need to formalize what we know already. We have a training set with m samples, written as the following equation: [ 51 ] Deep Learning Fundamentals Chapter 2 Each training sample consists of vector x, of size n, where n is the number of features: Each training sample also consists of a value y and, for logistic regression, this value is either zero or one. We also have a hypothesis function for logistic regression, which we can rewrite as the following equation: Using the same cost function for linear regression, with the hypotheses for logistic regression, we introduce a nonlinearity via the sigmoid function. This means that the cost function is no longer convex, and as a result, it may have a number of local minima, which can be a problem for gradient descent. It turns out that a function that works well for logistic regression, and results in a convex cost function, is the following: We can plot these functions for the two cases: [ 52 ] Deep Learning Fundamentals Chapter 2 It can be seen from the previous diagrams that, when the label y equals 1 and the hypothesis predicts 0, the cost approaches infinity. Also, when the actual value of y is 0, and the hypothesis predicts 1, similarly, the cost rises toward infinity. Alternatively, when the hypothesis predicts the correct value, either 0 or 1, the cost falls to 0. This is exactly what we want for logistic regression. Now we need to apply gradient descent to minimize the cost. We can rewrite the logistic regression cost function for binary classification to a more compact form, summing it over multiple training samples, using the following equation: Finally, we can update the parameter values with this update rule: Superficially, this looks identical to the update rule for linear regression; however, the hypothesis is a function of the sigmoid function, so it actually behaves quite differently. Nonlinear models We have seen that linear models, by themselves, fail to represent nonlinear real-world data. A possible solution is to add polynomial features to the hypotheses function. For example, a cubic model can be represented by the following equation: Here, we need to choose two derived features to add to our model. These added terms could simply be the square and cube of the size feature in the housing example. An important consideration when adding polynomial terms is feature scaling. The squared and cubic terms in this model will be of quite different scales. In order for gradient descent to work correctly, it is necessary to scale these added polynomial terms. [ 53 ] Deep Learning Fundamentals Chapter 2 Choosing polynomial terms is a way to inject knowledge into a model. For example, simply knowing that house prices tend to flatten out relative to floor space, as the floor space gets large, suggests adding squared and cubic terms, giving us the shape that we expect the data to take. However, feature selection where, say in logistic regression, when we are trying to predict a complicated multidimensional decision boundary, it may mean thousands of polynomial terms. Under such circumstances, the machinery of linear regression grinds to a halt. We will see that neural networks offer a more automated and powerful solution to complicated nonlinear problems. Artificial neural networks As the name suggests, ANNs are inspired by their biological counterpart, although the reason is, perhaps, misunderstood. An artificial neuron, or what we will call a unit, is grossly simplified compared to a biological neuron, both in terms of functionality and structure. The biological inspiration comes more from the insight that each neuron in a brain performs an identical function regardless of whether it is processing sound, vision, or pondering complex mathematics problems. This single algorithm approach is, fundamentally, the inspiration for ANNs. An artificial neuron, a unit, performs a single simple function. It adds up its inputs and, dependent on an activation function, gives an output. One of the major benefits of ANNs is that they are highly scalable. Since they are composed of fundamental units, simply adding more units in the right configuration allows ANNs to easily scale to massive, complex data. The theory of ANNs has been around for quite some time, first proposed in the early 1940's. However, it is not until recently that they have been able to outperform more traditional machine learning techniques. There are three broad reasons for this: The improvement in algorithms, notably the implementation of backpropagation, allowing an ANN to distribute the error at the output to input layers and adjust activation weights accordingly The availability of massive datasets to train ANNs The increase in processing power, allowing large-scale ANNs [ 54 ] Deep Learning Fundamentals Chapter 2 The perceptron One of the most simple ANN models is the perceptron, consisting of a single logistic unit. We can represent the perceptron in the following diagram: Each of the inputs are associated with a weight and these are fed into the logistic unit. Note that we add a bias feature, x0 = 1. This logistic unit consists of two elements: a function to sum inputs and an activation function. If we use the sigmoid as the activation function, then we can write the following equation: Note that this is exactly the hypothesis we used for logistic regression; we have simply swapped θ for w, to denote to the weights in the logistic unit. These weights are exactly equivalent to the parameters of the logistic regression model. To create a neural network, we connect these logistic units into layers. The following diagram represents a three-layered neural network. Note that for the sake of clarity, we omit the bias unit: [ 55 ] Deep Learning Fundamentals Chapter 2 This simple ANN consists of an input layer with three units; one hidden layer, also with three units; and finally, a single unit in the output. We use the notation ai(j) to refer to the (j) activation of unit i in layer j and with W denoting the matrix of weights that map layer j to layer j+1. Using this notation, we can express the activation of the three hidden units with the following equations: The activation of the output unit can then be expressed by the following equation: Here, W(1) is a 3 x 4 matrix controlling the function mapping between the input, layer one, and the single hidden layer, layer two. The weight matrix W(2), of size 1 X 4, controls the mapping between the hidden layer and the output layer, H. More generally, a network with sj units in layer j and sk units in layer j+1 will have a size of sk by (sj)+1. For example, for a network that has five input units and three units in the next forward layer, layer two, the (1) associated weight matrix, W will be of size 3 x 6. Having established a hypotheses function, the next step is to formulate a cost function to measure, and ultimately minimize, the error of the model. For classification, the cost function is almost identical to that used for logistic regression. The important difference is that with neural networks, we can add output units to allow multi-class classification. We can write the cost function for multiple outputs as follows: [ 56 ] Deep Learning Fundamentals Chapter 2 Here, K is the number of output units representing the number of output classes. Finally, we need to minimize the cost function, which is done using the backpropagation algorithm. Essentially, what this does is backpropagate the error, the gradient of the cost function, from the output units to the input units. To do this, we need to evaluate partial derivatives. That is, we need to compute the following: Here, l is the layer, j is the unit, and i is the sample. In other word, for each unit in each layer, and for every sample, we need to calculate the partial derivative, the gradient, of the cost function with respect to each parameter. For example, consider we have a network with four layers. Consider also that we are working with a single sample. We need to find the error at each layer, beginning at the output. The error at the output is just the error of the hypothesis: This is a vector of the error for each unit, j. The superscript (4) indicates this is the fourth layer; that is, the output layer. It turns out that, through some complicated math we do not need to worry about here, the error for the two hidden layers can be calculated with the following equations: The .*operator here is element-wise vector multiplication. Notice that the error vector of the next forward layer is required in each of these equations. That is, to calculate the error in layer three, the error vector of the output layer is required. Similarly, to calculate the error in layer two requires the error vector of layer three. [ 57 ] Deep Learning Fundamentals Chapter 2 This is how backpropagation works with a single sample. To loop across an entire dataset, we need to accumulate the gradients for each unit and each sample. So, for each sample in the training set, the neural net performs forward propagation to compute the activation for the hidden layers and the output layer. Then, for the same sample, that is within the same loop, the output error can be calculated. Consequently, we are able calculate the error for each previous layer in turn, and the neural net does exactly this, accumulating each gradient in a matrix. The loop begins again performing the identical set of operations on the next sample, and these gradients are also accumulated in the error matrix. We can write an update rule as follows: The capital delta is the matrix that stores the accumulated gradients by adding the activation for layer l, unit j, and sample i, then multiplying it with the associated error of the next forward layer for this same sample, i. Finally, once we have made a pass over the entire training set—an epoch—we can calculate the derivative of the cost function with respect to each parameter: Once again, it is not necessary to know the formal proof for this; it's just to give you some intuitive understanding of the mechanics of backpropagation. [ 58 ] Deep Learning Fundamentals Chapter 2 Summary We have covered a lot of material in this chapter. Don't worry if you do not understand some of the mathematics presented here. The aim is to give you some intuition into how some common machine learning algorithms work, not to have a complete understanding of the theory behind these algorithms. After reading this chapter, you should have some understanding of the following: General approaches to machine learning, including knowing the difference between supervised and unsupervised methods, online and batch learning, and rule-based, as opposed to model-based, learning Some unsupervised methods and their applications, such as clustering and principle component analysis Types of classification problems, such as binary, multi-class, and multi-out classification Features and feature transformations The mechanics of linear regression and gradient descent An overview of neural networks and the backpropagation algorithm In Chapter 3, Computational Graphs and Linear Models, we will apply some of these concepts using PyTorch. Specifically, we will show how to find the gradients of functions by building a simple linear model. You will also gain a practical understanding of backpropagation by implementing a simple neural network. [ 59 ] 3 Computational Graphs and Linear Models By now you should have an understanding of the theory of linear models and neural networks, as well as a knowledge of the fundamentals of PyTorch. In this chapter, we will be putting all this together by implementing some ANNs in PyTorch. We will focus on the implementation of linear models, and show how they can be adapted to perform multiclass classification. We will discuss the following topics in relation to PyTorch: autograd Computational graphs Linear regression Logistic regression Multi-class classification Computational Graphs and Linear Models Chapter 3 autograd As we saw in the last chapter, much of the computational work for ANNs involves calculating derivatives to find the gradient of the cost function. PyTorch uses the autograd package to perform automatic differentiation of operations on PyTorch tensors. To see how this works, let's look at an example: In the preceding code, we create a 2 x 3 torch tensor and, importantly, set the requires_grad attribute to True. This enables the calculation of gradients across subsequent operations. Notice also that we set the dtype to torch.float, since this is the data type that PyTorch uses for automatic differentiation. We perform a sequence of operations and then take the mean of the result. This returns a tensor containing a single scalar. This is normally what autograd requires to calculate the gradient of the preceding operations. This could be any sequence of operations; the important point is that all these operations are recorded. The input tensor, a, is tracking these operations, even though there are two intermediate variables. To see how this works, let's write down the sequence of operations performed in the preceding code with respect to the input tensor a: [ 61 ] Computational Graphs and Linear Models Chapter 3 Here, the summation and division by six represents taking the mean across the six elements of the tensor a. For each element, ai, the operations assigned to the tensor b, the addition of two, and c, squaring and multiplying by two, are summed and divided by six. Calling backward() on the out tensor calculates the derivative of the previous operation. This derivative can be written as the following, and if you know a little bit of calculus you will be able to easily confirm this: When we substitute the values of a into the right-hand side of the preceding equation, we do, indeed, get the values contained in the a.grad tensor, printed out in the preceding code. It is sometimes necessary to perform operations that do not need to be tracked on tensors that have requires_grad=True. To save memory and computational effort, it is possible to wrap such operations in a with torch.no_grad(): block. For example, observe the following code: To stop PyTorch tracking operations on a tensor, use the .detach() method. This prevents future tracking of operations and detaches the tensor from the tracking history: [ 62 ] Computational Graphs and Linear Models Chapter 3 Notice that if we try to calculate gradients a second time, by, for example calling out.backward(), we will again generate an error. If we do need to calculate gradients a second time, we need to retain the computational graph. This is done by setting the retain_graph parameter to True. For example, observe the following code: Notice that calling backward a second time adds the gradients to the ones already stored in the a.grad variable. Note that the grad buffer is freed once backward() is called without setting the retain_graph parameter to True. Computational graphs To get a better understanding of this, let's look at what precisely a computational graph is. We can draw the graph for the function we have been using so far as follows: [ 63 ] Computational Graphs and Linear Models Chapter 3 Here, the leaves of the graph represent the inputs and parameters of each layer, and the output represents the loss. Typically, unless retain_graph is set to True, on each iteration of an epoch, PyTorch will create a new computational graph. Linear models Linear models are an essential way to understand the mechanics of ANNs. Linear regression is used to both predict a continuous variable and also, in the case of logistic regression for classification, to predict a class. Neural networks are extremely useful for multi-class classification, since their architecture can be naturally adapted to multiple inputs and outputs. Linear regression in PyTorch Let's see how PyTorch implements a simple linear network. We could use autograd and backward to manually iterate through gradient descent. This unnecessarily low-level approach encumbers us with a lot of code that will be difficult to maintain, understand, and upgrade. Fortunately, PyTorch has a very straightforward object approach to building ANNs, using classes to represent models. The model classes we customize inherit all the foundation machinery required for building ANNs using the super class, torch.nn.Module. The following code demonstrates the standard way to implement modules (in this case, a linearModel) in PyTorch: [ 64 ] Computational Graphs and Linear Models Chapter 3 The nn.Module is the base class and is called through the super function on initialization. This ensures that it inherits all the functionality encapsulated in nn.Module. We set a variable, self.Linear, to the nn.Linear class, reflective of the fact we are building a linear model. Remember, a linear function with one independent variable, that is one feature, x, can be written in the following way: The nn.linear class contains two learnable variables: bias and weight. In our single feature model, these are the two parameters, w0 and w1, respectively. When we train a model, these variables are updated, ideally to values that approach the line of best fit to the data. Finally, in the preceding code, we instantiate the model by creating the variable, model, and setting it to our LinearModel class. Before we can run the model, we need set the learning rate, the type of optimizer to use, and the criteria to measure the loss. This is done with the following code: As you can see, we set the learning rate to 0.01. This tends to be a good starting point; any higher and the optimizer may overshoot the optimum, any lower and it may take too long to find it. We set the optimiser to stochastic gradient descent, passing it the items we need it to optimize (in this case, the model parameters), and also the learning rate to use on each step of the gradient descent. Finally, we set the loss criteria; that is, the criteria gradient descent will be used to measure the loss, and here we set it to the mean square error. To test this linear model, we need to feed it some data and, for testing purposes, we create a simple dataset, x, consisting of numbers from 1 to 10. We create the output, or target, data by applying a linear transformation on the input values. Here, we use the linear function, y= 3*x + 5. This is coded as follows: Note that we need to reshape these tensors so the input, x, and the target, y, have the same shapes. Note also that we do not need to set autograd, as this is all handled by the model class. We do, however, need to tell PyTorch that the input tensor is of data type torch.float, since, by default, it will treat the list as integers. [ 65 ] Computational Graphs and Linear Models Chapter 3 Now we are ready to run the linear model and to do this we run it in a loop for each epoch. This training cycle consists of the following three steps: 1. A forward pass over the training set 2. A backward pass to compute the loss 3. Updating the parameters according to the gradient of the loss function This is done with the following code: We set epoch to 1000. Remember, each epoch is one full pass over the training set. The model inputs are set to the x values of the dataset; in this case, it is simply the sequence of numbers from 1 to 10. We set the labels to the y values; in this case, the values calculated by our function, 2*x + 5. [ 66 ] Computational Graphs and Linear Models Chapter 3 Importantly, we need to clear the gradients so that they do not accumulate over epochs and distort the model. This is achieved by calling the zero_grad() function on the optimizer on each epoch. The out tensor is set to the linear models output, calling the forward function of the LinearModel class. This model applies a linear function, with the current estimate of the parameters, and gives a predicted output. Once we have an output, we can calculate the loss using the mean square error, comparing the actual y values to the values calculated by the model. Next, the gradient can calculate by calling backwards() on the loss function. This determines the next step of the gradient descent, enabling the step() function to update parameter values. We also create a predicted variable that will store the predicted values of x. We will use this shortly when we plot the predictions and actual values of x. To understand if our model is working, we print the loss on each epoch. Notice the loss is decreasing each time, indicating it is working as expected. Indeed, by the time the model completes 1000 epochs, the loss is quite small. We can print the model's state (that is, the parameter values) by running the following code: Here, the linear.weight tensor consists of the single element of value 3.0113 and the linear.bias tensor contains the value 4.9210. This is very close to the values of w0 (5) and w1 (3) that we used to create the linear dataset through the y=3x + 5 function. To make this a little more interesting, let's see what happens when, instead of using a linear function to create the labels, we add a squared term to the function (for example, y= 3x2 + 5). We can visualize the result of the model by graphing the predicted values against the actual values. We can see the result with the following code: [ 67 ] Computational Graphs and Linear Models Chapter 3 We have used the y = 3x2 + 5 function to generate the labels. The squared term gives the training set the characteristic curve and the linear model's predictions are the best fit straight line. You can see that after 1,000 epochs, this model does a reasonably good job at fitting the curve. Saving models Once a model has been built and trained, it is common to want to save the model's state. This is not so important in cases like this, when training takes an insignificant amount of time. However, with large datasets, and many parameters, training can potentially take hours our even days to complete. Clearly, we do not want to retrain a model every time we need it to make a prediction on new data. To save a trained model's parameters, we simply run the following code: [ 68 ] Computational Graphs and Linear Models Chapter 3 The preceding code saves the model using Python's inbuilt object serialization module, pickle. When we need to restore the model, we can do the following: Note that we need our LinearModel class in memory for this to work, since we are only saving the model's state; that is, the model parameters, not the entire model. To retrain the model once we have restored it, we need to reload the data and set the model hyperparameters (in this case the optimizer, learning rate, and criterion). Logistic regression A simple logistic regression model does not look a great deal different from the model for linear regression. The following is a typical class definition for a logistic model: Notice that we still use a linear function when we initialize the model class. However, for logistic regression, we need an activation function. Here, this is applied when forward is called. As usual, we instantiate the model into our model variable. Next, we set the criterion and optimizer: [ 69 ] Computational Graphs and Linear Models Chapter 3 We still use stochastic gradient descent; however, we need to change the criterion for the loss function. With linear regression, we used the MSELoss function to calculate the mean square error. For logistic regression, we are working with probabilities represented by values between zero and 1. It does not make much sense to calculate the mean squared error of a probability; instead, a common technique is to use the cross-entropy loss or log loss. Here, we use the BCELoss function, or binary cross-entropy loss. The theory behind this is a little involved. What is important to understand is that it is essentially a logarithmic function that better captures the notion of a probability. Because it is logarithmic, as a predicted probability approaches 1, the log loss slowly decreases toward zero given a correct prediction. Remember, we are trying to calculate a penalty for an incorrect prediction. The loss must increase as the prediction diverges from the true value. Cross-entropy loss penalizes predictions that have high confidence (that is, they are close to 1, and are incorrect) and, conversely, rewards predictions that have lower confidence but are correct. We can train the model with the identical code used for linear regression, running each epoch in a for loop where we do a forward pass to calculate an output, a backward pass to calculate the loss gradient, and finally, update the parameters. Let's make this a little more concrete by creating a practice example. Suppose we are trying to categorize the species of an insect by some numerical measure, say the length of its wings. We have some training data as follows: Here, the x_train values could represent the wing length in millimeters and the y_train values each sample's label; one indicated the sample belongs to the target species. Once we have instantiated the LogisticModel class, we can run it using the standard running code. [ 70 ] Computational Graphs and Linear Models Chapter 3 Once we have trained the model, we can test it using some new data: Activation functions in PyTorch Part of the trick that makes ANNs perform as well as they do is the use of nonlinear activation functions. A first thought is simply to use a step function. In this case, an output from a particular occurs only when the input exceeds zero. The problem with the step function is that it cannot be differentiated, since it does not have a defined gradient. It consists only of flat sections and is discontinuous at zero. Another method is to use a linear activation function; however, this restricts our output to a linear function as well. This is not what we want, since we need to model highly nonlinear real-world data. It turns out that we can inject nonlinearity into our networks by using nonlinear activation functions. The following is a plot for popular activation functions: [ 71 ] Computational Graphs and Linear Models Chapter 3 The ReLU, or rectified linear unit, is generally considered the most popular activation function. Even though it is non- differentiable at zero, it has a characteristic elbow that can make gradient descent jump around, and it does, in practice, work very well. One of the advantages of the ReLU function is that it is very fast to compute. Also, it does not have a maximum value; it continues to rise to infinity as its input rises. This can be advantageous in certain situations. We have already met the sigmoid function; its major advantage is that is is differentialable at all input values. This can help in situations where the ReLU function causes erratic behavior during gradient descent. The sigmoid function, unlike ReLU, is constrained by asymptotes. This also can used beneficial for some ANNs. The softmax function is typically used on output layers for multi-class classification. Remember, multiclass classification, in contrast with multi-label classification, has only one true output. In such cases, we need the predicted target to be as close to 1 as possible and all other outputs close to zero. The softmax function is a nonlinear form of normalization. We need to normalize the output to ensure we are approximating the probability distribution of the input data. Rather than use linear normalization by simply dividing all outputs by their sum, softmax applies a nonlinear exponential function that increases the impact of outlying data points. This tends to increase a network's sensitivity by increasing its reaction to low stimuli. It is computationally more complex than other activation functions; however, it turns out to be an effective generalization of the sigmoid function for multi-class classification. The tanh activation function, or hyperbolic tangent function, is primarily used for binary classification. It has asmpotopes at -1 and 1 and is often used as an alternative to the sigmoid function, where strongly negative input values cause the sigmoid to output values very close to zero, causing the gradient descent to get stuck. The tanh function will output negatively in such situations, allowing the calculation of meaningful parameters. Multi-class classification example So far, we have been using trivial examples to demonstrate core concepts in PyTorch. We are now ready to explore a more real-world example. The dataset we will be using is the MNIST dataset of hand-written digits from 0 to 9. The task is to correctly identify each sample image with the correct digit. [ 72 ] Computational Graphs and Linear Models Chapter 3 The classification model we will be building consists of several layers and these are outlined in the following diagram: The images we are working with are 28 x 28 pixels in size, and each pixel in each image is characterized by a single number, indicating its gray scale. This is why we need 28 x 28 or 784 inputs to the model. The first layer is a linear layer with 10 outputs, one output for each label. These outputs are fed into to the softmax activation layer and cross-entropy loss layer. The 10 output dimensions represent the 10 possible classes, the digits zero to nine. The output with the highest value indicates the predicted label of a given image. We begin by importing the required libraries, as well as the MNIST dataset: Now let's print out some information about the MNIST dataset: [ 73 ] Computational Graphs and Linear Models Chapter 3 The len function returns the number of separate items (in this case, single images) in the dataset. Each one of these images is encoded as a type tensor and the size of each image is 28 x 28 pixels. Each pixel in the image is assigned a single number, indicating its gray scale. To define our multi-class classification model, we are going to use the exactly the same model definition that we used for linear regression: Even though, ultimately, we need to perform logistic regression, we achieve the required activation and nonlinearity in a slightly different way to the binary case. You will notice that in the model definition, the output returned by the forward function is simply a linear function. Instead of using the sigmoid function, as we did in the previous binary classification example, here we use the softmax function, which is assigned with the loss criterion. The following code sets up these variables and instantiates the model: The CrossEntropyLoss() function essentially adds two layers to the network: a softmax activation function and a cross- entropy loss function. Each input to the network takes one pixel of the image, so our input dimension is 28 x 28 = 784. The optimizer uses stochastic gradient descent and a learning rate of .0001. Next, we set a batch size, the number of epochs to run the model, and create a data loader object so the model can iterate over the data: [ 74 ] Computational Graphs and Linear Models Chapter 3 Setting a batch size feeds the data into the model in specific-sized chunks. Here, we feed the model in batches of 100 images. The number of iterations (that is, the total number of forward-backward traversals of the network), can be calculated by dividing the length of the dataset by the batch size, and multiplying this by the number of epochs. In this example, we have 5 x 60,000/100 = 3,000 iterations in total. It turns out this is a much more efficient and effective way to work with moderate to large datasets, since, with finite memory, loading the entire data may not be possible. Also, the model tends to make better predictions since it is trained on a different subsets of data with each batch. Setting shuffle to True shuffles the data on each epoch. To run this model, we need to create an outer loop that loops through the epochs and an inner loop that loops through each batch. This is achieved with the following code: This is similar to the code we have used to run all our models so far. The only difference here is that the model enumerates over each batch in trainloader rather than iterating over the entire dataset at once. Here, we print out the loss on each epoch and, as expected, this loss is decreasing. [ 75 ] Computational Graphs and Linear Models Chapter 3 We can make a prediction using the model by making a forward pass: The size of the predicted variable is 100 by 10. This represents the predictions for the 100 images in the batch. For each image, the model outputs a 10 element prediction tensor, containing a value representing the relative strength of each label at each of its 10 outputs. The following code prints out the first prediction tensor and the actual label: If we look closely at the previous output, we see that the model correctly predicted the label since the second element, representing the digit 1, contains the highest value of 1.3957. We can see the relative strength of this prediction by comparing it to other values in the tensor. For example, we can see that the next strongest prediction was for the number 7, with a value of 0.9142. You will see that the model is not correct for every image and to begin to evaluate and improve our models, we need to be able to measure its performance. The most straightforward way is to measure its success rate; that is, the proportion of correct results. To do this, we create the following function: [ 76 ] Computational Graphs and Linear Models Chapter 3 Here, we use string comprehensions, firstly to create a list of predictions by finding the maximum of each output. Next, we create a list of labels to compare the predictions. We create a list of correct values by comparing each element in the predict list with the corresponding element in the actual list. Finally, we return the success rate by dividing the number of correct values with the total number of predictions made. We can calculate the success rate of our model by calling this function with the output predictions and the labels: Here, we get a success rate of 83%. Note that this is calculated using images the model has already trained on. To truly test the model's performance, we need to test it on images it has not seen before. We do this with the following code: Here, we have tested the model using the entire 10,000 images in the MNIST test set. We create an iterator from the data loader object and then load them in to the two tensors, images and labels. Next, we get an output (here, a 10 by 10,000 prediction tensor), by passing the model test images. Finally, we run the SuccessRate function with the output and labels. The value is only slightly lower than the success rate on the training set, so we can be reasonably confident that this is an accurate measure of the model's performance. [ 77 ] Computational Graphs and Linear Models Chapter 3 Summary In this chapter, we have explored linear models and applied them to the tasks of linear regression, logistic regression, and multi-class classification. We have seen how autograd calculates gradients and how PyTorch works with computational graphs. The multi-class classification model we built did a reasonable job of predicting hand-written digits; however, its performance is far from optimal. The best deep learning models are able to get near 100% accuracy on this dataset. We will see in Chapter 4, Convolutional Networks, how adding more layers and using convolutional networks can improve performance. [ 78 ] 4 Convolutional Networks Previously, we built several simple networks to solve regression and classification problems. These illustrated the basic code structure and concepts involved in building ANNs with PyTorch. In this chapter, we will extend simple linear models by adding layers and using convolutional layers to solve nonlinear problems found in real-world examples. Specifically, we will cover the following topics: Hyper-parameters and multilayered networks Build a simple benchmarking function to train and test models Convolutional networks Hyper-parameters and multilayered networks Now that you understand the process of building, training, and testing models, you will see that expanding these simple networks to increase performance is relatively straightforward. You will find that nearly all models we build consist, essentially, of the following six steps: 1. 2. 3. 4. 5. 6. Import data and create iterable data-loader objects for the training and test sets Build and instantiate a model class Instantiate a loss class Instantiate an optimizer class Train the model Test the model Convolutional Networks Chapter 4 Of course, once we complete these steps, we will want to improve our models by adjusting a set of hyper-parameters and repeating the steps. It should be mentioned that although we generally consider hyper-parameters things that are specifically set by a human, the setting of these hyper-parameters can be partially automated, as we shall see in the case of the learning rate. Here are the most common hyper-parameters: The learning rate of gradient descent The number of epochs to run the model The type of nonlinear activation The depth of the network, that is, the number of hidden layers The width of the network, that is, the number of neurons in each layer The connectivity of the network (for example, convolutional networks) We have already worked with some of these hyper-parameters. We know the learning rate, if set too small, will take more time than necessary to find the optimum, and if set too large, will overshoot and behave erratically. The number of epochs is the number of complete passes over the training set. We would expect that as we increase the number of epochs, the accuracy will improve on each epoch, given limitations on the dataset and the algorithm used. At some point, the accuracy will plateau and training over more epochs is a waste of resources. If the accuracy decreases over the first few epochs, one of the most likely causes is that the learning rate is set too high. Activation functions play a critical role in classification tasks and the effect of different types of activation can be somewhat subtle. It is generally agreed that the ReLU, or rectified linear function, performs best on the most common practice datasets. This is not to say that other activation functions, particularly the hyperbolic tangent or tanh function and variations on these, such as leaky ReLU, can produce better results under certain conditions. As we increase the depth, or number of layers, we increase the learning power of the network, enabling it to capture more complex features of the training set. Obviously this increased ability is very much dependant on the size and complexity of the dataset and the task. With small datasets and relatively simple tasks, such as digit classification with MNIST, a very small number of layers (one or two) can give excellent results. Too many layers waste resources and tend to make the network overfit or behave erratically. [ 80 ] Convolutional Networks Chapter 4 Much of this is true when we come to increasing the width, that is, the number of units in each layer. Increasing the width of a linear network is one of the most the most efficient ways to boost learning power. When it comes to convolutional networks, as we will see, not every unit is connected to every unit in the next forward layer; connectivity, that is, the number of input and output channels in each layer, is critical. We will look at convolutional networks shortly, but first we need to develop a framework to test and evaluate our models. Benchmarking models Benchmarking and evaluation are core to the success of any deep learning exploration. We will develop some simple code to evaluate two key performance measures: the accuracy and the training time. We will use the following model template: This model is the most common and basic linear template for solving MNIST. You can see we initialize each layer, in theinit method, by creating a class variable that is assigned to a PyTorch nn object. Here, we initialize two linear functions and a ReLU function. The nn.Linear function takes an input size of 28*28 or 784. This is the size of each of the training images. The output channels or the width of the network are set to 100. This can be set to anything, and in general a higher number will give better performance within the constraints of computing resources and the tendency for wider networks to overfit training data. [ 81 ] Convolutional Networks Chapter 4 In the forward method, we create an out variable. You can see that the out variable is passed through an ordered sequence consisting of a linear function, a ReLU function, and another linear function before being returned. This is a fairly typical network architecture, consisting of alternating linear and nonlinear layers. Let's now create two more models, replacing the ReLU function with the tanh and sigmoid activation functions. Here is the tanh version: You can see we simply changed the name and replaced the nn.RelU() function with nn.Tanh(). Create a third model in exactly the same way, replacing nn.Tanh() with nn.Sigmoid(). Don't forget to change the name in the super constructor and in the variable used to instantiate the model. Also remember to change the forward function accordingly. Now, let's create a simple benchmark function that we can use to run and record the accuracy and training time of each of these models: [ 82 ] Convolutional Networks Chapter 4 Hopefully, this is fairly self-explanatory. The benchmark function takes two required parameters: the data and the model to be evaluated. We set default values for epochs and the learning rate. We need to initialize the model so we can run it more than once at a time on the same model, otherwise the model parameters will accumulate, distorting our results. The running code is identical to the code used for the previous models. Finally, we print out the accuracy and the time taken to train. The training time calculated here is really only an approximate measure, since training time will be affected by whatever else is going on on in the processor, the amount of memory, and other factors beyond our control. We should only use this result as a relative indication of a model's time performance. Finally, we need a function to calculate the accuracy, and this is defined as follows: [ 83 ] Convolutional Networks Chapter 4 Remember to load the training and test datasets and make them iterable exactly as we did before. Now, we can run our three models and compare them using something like the following: We can see that both the Tanh and ReLU functions perform significantly better than sigmoid. For most networks, the ReLU activation function on hidden layers give the best results, both in terms of accuracy and the time it takes to train. The ReLU activation is not used on output layers. For the output layers, since we need to calculate the loss, we use the softmax function. This is the criterion for the loss class and, as before, we use CrossEntropy Loss(), which, if you remember, includes the softmax function. There are several ways we can improve from here; one obvious way is simply to add more layers. This is typically done by adding alternating pairs of nonlinear and linear layers. In the following, we use nn.Sequential to organize our layers. In our forward layer, we simply have to call sequential objects, rather than every individual layer and function. This makes our code more compact and easier to read: [ 84 ] Convolutional Networks Chapter 4 Here, we add two more layers: a linear layer and a nonlinear ReLU layer. It is particularly important how we set the input and output sizes. In the first linear layer, the input size is 784, this is the image size. The output of this layer, something we choose, is set to 100. The input to the second linear layer, therefore, must be 100. This is the width, the number of kernels and feature maps, of the output. The output of the second linear layer is something we choose, but the general idea is to decrease the size, since we are trying to filter down the features to just 10, our target classes. For fun, create some models and try out different input and output sizes, remembering that the input to any layer must be the same size as the output of the previous layer. The following is the output of three models, where we print the output sizes of each of the hidden layers to give you an idea of what is possible: We can continue to add as many layers and kernels as we desire, however this is not always a good idea. How we set up input and output sizes in a network is intimately connected to the size, shape, and complexity of the data. For simple datasets, such as MNIST, it is pretty clear that a few linear layers gets very good results. At some point, simply adding linear layers, and increasing the number of kernels will not capture the highly nonlinear features of complex datasets. [ 85 ] Convolutional Networks Chapter 4 Convolutional networks So far, we have used fully connected layers in our networks, where each input unit represents a pixel in an image. With convolutional networks, on the other hand, each input unit is assigned a small localized receptive field. The idea of the receptive field, like ANNs themselves, is modelled on the human brain. In 1958, it was discovered that neurons in the visual cortex of the brain respond to stimuli in a limited region of a field of vision. More intriguing is that sets of neurons respond exclusively to certain basic shapes. For example, a set of neurons may respond to horizontal lines, while others respond only to lines at other orientations. It was observed that sets of neurons could have the same receptive field, but respond to different shapes. It was also noticed that neurons were organized into layers with deeper layers responding to more complex patterns. This, it turns out, is a remarkably effective way for a computer to learn and categorize a set of images. A single convolutional layer Convolutional layers are organized so the units in the first layer only respond to their respective receptive fields. Each unit in the next layer is connected only to a small region of the first layer, and each unit in the second hidden layer is connected to a limited region in the third layer, and so on. In this way, the network can be trained to assemble higher level features from the low-level features present in the previous layer. In practice, this works by using a filter, or convolution kernel, to scan an image to generate what is known as a feature map. The kernel is just a matrix that is the size of the receptive field. We can think of this as a camera scanning an image in discrete strides. We calculate a feature map matrix by an element-wise multiplication of the kernel matrix with the values in the receptive field of an image. The resultant matrix is then summed to compute a single number in the feature map. The values in the kernel matrix represent a feature we want to extract from the image. These are the parameters that we ultimately want the model to learn. Consider a simple example where we are attempting to detect horizontal and vertical lines in an image. To simplify things, we will use one input dimension; this is either black, represented by a 1, or white, represented by a 0. Remember that in practice these would be scaled and normalized floating-point numbers representing a grayscale or color value. Here, we set the kernel to 4 x 4 pixels and we scan using a stride of 1. A stride is simply the distance we move the kernel, so a stride of 1 moves the kernel one pixel: [ 86 ] Convolutional Networks Chapter 4 One convolution is one complete scan of the image and each convolution generates a feature map. On each stride, we do an element-wise multiplication of the kernel with the receptive field of the image and we sum the resulting matrix. You will notice that as we move the kernel across the image, as shown in the preceding diagram, stride 1 samples the top-left corner, stride 2 samples the patch-one pixel to the left, stride 3 would sample one pixel to the left again, and so on. When we reach the end of the first row, we need to add a padding pixel, so set the value to 0 in order to sample the edges of the image. Padding input data with zeros is called valid padding. If we did not pad the image, the feature map would be smaller, in dimensions, than the original image. Padding is used to ensure that there is no loss of information from the original. [ 87 ] Convolutional Networks Chapter 4 It is important to understand the relationship between input and output sizes, kernel size, padding, and stride. They can be expressed quite neatly in the following formula: Here, O = output size, W = input height or width, K = kernel size, P = padding, and S = stride. Note that the input height, or width, assumes these two are the same—that is, the input image is square, not rectangular. If the input image is a rectangle, we need to calculate output values separately for the width and height. The padding can be calculated as follows: Multiple kernels In each convolution, we can include multiple kernels. Each kernel in a convolution generates its own feature map. The number of kernels is the number of output channels, which is also the number of feature maps generated by the convolutional layer. We can generate further feature maps by using another kernel. As an exercise, calculate the feature map that would be generated by the following kernel: By stacking kernels, or filters, and using kennels of different sizes and values, we can extract a variety of features from an image. Also, remember that each kernel is not restricted to one input dimension. For example, if we are processing an RGB color image, each kernel would have an input dimension of three. Since we are doing element-wise multiplication, the kernel must be the same size as the receptive field. When we have three dimensions, the kernel needs to have an input depth of three. So our greyscale 2 x 2 kernel becomes a 2 x 2 x 3 matrix for a color image. We still generate a single feature map on each convolution for each kernel. We are still able to do element-wise multiplication, since the kernel size is the same as the receptive field, except now when we do the summation, we sum across the three dimensions to get the single number required on each stride. [ 88 ] Convolutional Networks Chapter 4 As you can imagine, there are a large number of ways we can scan an image. We can change the size and value of the kernel, or we can change its stride, include padding and even include noncontiguous pixels. To get a better idea of some of the possibilities, check out vdumoulin's excellent animations: https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md. Multiple convolutional layers As with the fully connected linear layers, we can add multiple convolutional layers. As with linear layers, the same restrictions apply: Limitations on time and memory (computational load) Tendency to overfit a training set and not generalize to a test set Requires larger datasets to work effectively The benefit of the appropriate addition of convolution layers is that, progressively, they are able to extract more complex, nonlinear features from datasets. Pooling layers Convolutional layers are typically stacked using pooling layers. The purpose of a pooling layer is to reduce the size, but not the depth, of the feature map generated by the preceding convolution. A pooling layer retains the RGB information but compresses the spatial information. The reason we do this is to enable kernels to focus selectively on certain nonlinear features. This means we can reduce the computational load by focusing on the parameters that have the strongest influence. Having fewer parameters also reduces the tendency to overfit. There are three major reasons why pool layers are used to reduce dimensions of the output feature map: Reduces computational load by discarding irrelevant features Smaller number of parameters, so less likely to overfit data Able to extract features that are transformed in some way, for example images of an object from different perspectives [ 89 ] Convolutional Networks Chapter 4 Pooling layers are very similar to normal convolution layers in that they use a kernel matrix, or filter, to sample an image. The difference with pooling layers is that we downsample the input. Downsampling reduces the input dimensions. This can be achieved by either increasing the size of the kernel or increasing the stride, or both. Check the formula in the section on single convolutional layers to confirm this is true. Remember, in a convolution all we are doing is multiplying two tensors on each stride, over an image. Each subsequent stride in a convolution samples another part of the input. This sampling is achieved by element-wise multiplication of the kernel with the output of the previous convolution layer, encompassed by that particular stride. The result of this sampling is a single number. With a convolution layer, this single number is the sum of the element-wise multiplication. With a pooling layer, this single number is typically generated by either the average or the maximum of the element-wise multiplication. The terms average pooling and max pooling refer to these different pooling techniques. Building a single-layer CNN So now we should have enough theory to build a simple convolution network and understand how it works. Here is a model class template we can start with: The basic convolutional unit we will be using is in PyTorch is the nn.Conv2d module. It is characterized by the following signature: nn.Conv2d(in_channels, outs_channels, kernel_size, stride=1, padding = 0) [ 90 ] Convolutional Networks Chapter 4 The values of these parameters are constrained by the size of the input data and the formulae discussed in the last section. In this example, in_channels is set to 1. This refers to the fact that our input image has one color dimension. If we were working with a threechannel color image, this would be set to 3. out_channel is the number of kernels. We can set this to anything, but remember there are computational penalties, and improved performance is dependant on having larger, more complex datasets. For this example, we set the number of output channels to 16. The number of output channels, or kernels, is essentially the number of low-level features we think might be indicative of the target class. We set the stride to 1 and the padding to 2. This ensures the output size remains the same as the input; this can be verified by plugging these values into the output formula in the section on single convolutional layers. In the __init__ method, you will notice we instantiate a convolutional layer, a ReLU activation function, a MaxPool2d layer, and a fully connected linear layer. The important thing here is to understand how we derive the values we pass to the nn.Linear() function. This is the output size of the MaxPool layer. We can calculate this using our output formula. We know that the output from the convolutional layer is the same as the input. Because the input image is square, we can use 28, the height or width, to represent the input, and consequently the output size of the convolutional layer. We also know that we have set a kernel size of 2. By default, MaxPool2d assigns the stride to the kernel size and uses implied padding. For practical purposes, this means that when we use default values for stride and padding, we can simply divide the input, here 28, by the kernel size. Since our kernel size is 2, we can calculate an output size of 14. Since we are using a fully connected linear layer, we need to flatten the width, height, and the number of channels. We have 32 channels, as set in the out_channels parameter of nn.Conv2d. Therefore, the input size is 16 X 14 X 14. The output size is 10 because, as with the linear networks, we use the output to distinguish between the 10 classes. The forward function of the model is fairly straightforward. We simply pass the out variable through the convolutional layer, the activation function, the pooling layer, and the fully connected linear layer. Notice that we need to resize the input for the linear layer. Assuming the batch size is 100, the output of the pooling layer is a four-dimensional tensor: 100, 32, 14, 14. Here, out.view(out.size(0), -1) reshapes this fourdimensional tensor to a two-dimensional tensor: 100, 32*14*14. [ 91 ] Convolutional Networks Chapter 4 To make this a little more concrete, let's train our model and look at a few variables. We can use almost identical code to train the convolutional model. We do, however, need to change one line in our benchmark() function. Since convolution layers can accept multiple input dimensions, we do not need to flatten the width and height of the input. For the previous linear models, in our running code, we used the following to flatten the input: outputs= model(images.view(-1, 28*28)) For our convolutional layer, we do not need to do this; we can simply pass the model the image, as in the following: outputs = model(images) This line must also be changed in the accuracy() function we defined in the section on bench marking earlier in this chapter. Building a multiple-layer CNN As you would expect, we can improve this result by adding another convolutional layer. When we are adding multiple layers, it is convenient to bundle each layer into a sequence. It is here that nn.Sequential comes in handy: [ 92 ] Convolutional Networks Chapter 4 We initialize two hidden layers and a fully connected linear output layer. Note the parameters passed to the Conv2d instances and the linear output. As before, we have one input dimension. From this, our convolutional layer outputs 16 feature maps or output channels. This diagram represents the two-layered convolutional network: This should make it clear how we calculate the output sizes, and in particular how we derive the input size for the linear output layer. We know, using the output formula, that the output size of the first convolutional layer, before max pooling, is the same as the input size, that is 28 x 28. Since we are using 16 kernels or channels, generating 16 feature maps, the input to the max pooling layer is a 16 x 28 x 28 tensor. The max pooling layer, with a kernel size of 2, a stride of 2, and the default implicit padding means that we simply divide the feature map size by 2 to calculate the max pool out put size. This gives us an output size of 16 x 14 x 14. This is the input size to the second convolutional layer. Once again, using the output formula, we can calculate that the second convolutional layer, before max pooling, generates 14 x 14 feature maps, the same size as its input. Since we set the number of kernels to 32, the input to the second max pooling layer is a 32 x 14 x 14 matrix. Our second max pooling layer is identical to the first, with the kernel size and stride set to 2 and default implicit padding. Once again, we can simply divide by 2 to calculate the output size, and therefore the input to the linear output layer. Finally, we need to flatten this matrix to one dimension. So the input size for the linear output layer is a single dimension of 32 * 7 * 7, or 1,568. As usual, we need the output size of the final linear layer to be the number of classes, which in this case is 10. [ 93 ] Convolutional Networks Chapter 4 We can inspect the model parameters to see that is exactly what is happening when we run the code: The model parameters consist of six tensors. The first tensor is the parameter for the first convolution layer. It consists of 16 kernels, 1 color dimension, and a kernel of size 5. The next tensor is the bias and has a single dimension of size 16. The third tensor in the list is the 32 kernels in the second convolutional layer, the 16 input channels, the depth, and the 5 x 5 kernel. In the final linear layer, we flattened these dimensions to 10 x 1568. Batch normalization Batch normalization is used widely to improve the performance of neural networks. It works by stabilizing the distributions of layer input. This is achieved by adjusting the mean and variance of these input. It is fairly indicative of the nature of deep learning research that there is uncertainty among the researcher community as to why batch normalization is so effective. It was thought that this was because it reduces the so called internal co-variate shift (ICS). This refers to the change in distributions as a result of the preceding layers' parameter updates. The original motivation for batch normalization was to reduce this shift. However, a clear link between ICS and performance has not been conclusively found. More recent research has shown that batch normalization works by smoothing the optimization landscape. Basically, this means that gradient descent will work more efficiently. Details of this can be found in How Does Batch Normalization Help Optimization? by Santurkar et al., which is available at https://arxiv.org/abs/1805.11604. [ 94 ] Convolutional Networks Chapter 4 Batch normalization, implemented with the nn.BatchNorm2d function: This model is identical to the previous two-layer CNN with the addition of the batch normalization of the output of the convolutional layers. The following is a printout of the performance of the three convolutional networks we have built so far: [ 95 ] Convolutional Networks Chapter 4 Summary In this chapter, we saw how we could improve the simple linear network developed in Chapter 3, Computational Graphs and Linear Models. We can add linear layers, increase the width of the network, increase the number of epochs we run the model, and tweak the learning rate. However, linear networks will not be able to capture the nonlinear features of datasets, and at some point their performance will plateau. Convolutional layers, on the other hand, use a kernel to learn nonlinear features. We saw that with two convolutional layers, performance on MNIST improved significantly. In the next chapter, we'll look at some different network architectures, including recurrent networks and long short-term networks. [ 96 ] 5 Other NN Architectures Recurrent networks are essentially feedforward networks that retain state. All the networks we have looked at so far require an input of a fixed size, such as an image, and give a fixed size output, such as the probabilities of a particular class. Recurrent networks are different in that they accept a sequence, of arbitrary size, as the input and produce a sequence as output. Moreover, the internal state of the network's hidden layers is updated as a result of a learned function and the input. In this way, a recurrent network remembers its state. Subsequent states are a function of previous states. In this chapter, we will cover the following: Introduction to recurrent networks Long short-term memory networks Introduction to recurrent networks Recurrent networks have been shown to be very powerful in predicting time series data. This is something fundamental to biological brains that enables us to do things such as safely drive a car, play a musical instrument, evade predators, understand language, and interact with a dynamic world. This sense of the flow of time and the understanding of how things change over time is fundamental to intelligent life, so it is no surprise that in artificial systems this ability is important. The ability to understand time series data is also important in creative endeavors, and recurrent networks have shown some ability in things such as composing a melody, constructing grammatically correct sentences, and creating visually pleasing images. Other NN Architectures Chapter 5 Feedforward and convolutional networks achieve very good results, as we have seen, in tasks such as the classification of static images. However, working with continuous data, as is required for tasks such as speech or handwriting recognition, predicting stock market prices, or forecasting the weather requires a different approach. In these types of tasks, both the input and the output are no longer a fixed size of data, but a sequence of arbitrary length. Recurrent artificial neurons For artificial neurons in feedforward networks, the flow of activation is simply from the input to the output. Recurrent artificial neurons (RANs) have a connection from the output of the activation layer to its linear input, essentially summing the output back into the input. A RAN can be unrolled in time: each subsequent state is a function of previous states. In this way, a RAN can be said to have a memory of its previous states: In the preceding diagram, the diagram on the left illustrates a single recurrent neuron. It sums its input, x, with the output, y, to produce a new output. The diagram on the right shows this same unit unrolled over three time steps. We can write an equation for the output with respect to the input for any given time step as follows: Here, y(t) is the output vector at time t, x(t) is the input at time t, y(t-1) is the output of the previous time step, b is the bias term, and Φ is the activation, usually tanh or RelU. Notice that each unit has two sets of weights, wx and wy, for the inputs and the outputs respectively. This is, essentially, the formula we used for our linear networks with an added term to represent the output, fed back into the input at time t-1. [ 98 ] Other NN Architectures Chapter 5 In the same way that with CNNs (Convolutional Neural Networks) we could compute the outputs of an entire layer over a batch, using a vectorized form of the previous equation, this is also possible with recurrent networks. The following is the vectorized form for a recurrent layer: Here, Y(t) is the output at time t. This is a matrix of size (m, n), where m is the number of instances in the batch and n is the number of units in the layer. X(t) is a matrix of size (m, i) where i is the number of input features. Wx is a matrix of size (i, n), containing the input weights of the current time step. Wy is a matrix of size (n, n), containing the weights of the outputs for the previous time step. Implementing a recurrent network So we can concentrate on the models, we will use the same dataset we are familiar with. Even though we are working with static images, we can treat these as a time series by unrolling each 28 pixel input size over 28 time steps, enabling the network to make a computation on the complete image: [ 99 ] Other NN Architectures Chapter 5 In the preceding model, we use the nn.RNN class to create a model with two recurrent layers. The nn.RNN class has the following default signature: nn.RNN(input_size, hidden_size, num_layers, batch_first=False, nonlinearity = 'tanh' The input is our 28 x 28 MNIST images. This model takes 28 pixels of each image, unrolling them over 28 time steps to make a computation over the entirety of all images in the batch. The hidden_size parameter is the dimension of the hidden layers, and this is something we choose. Here, we use a size of 100. The batch_first parameter specifies the expected shape of the input and output tensors. We want this to have the shape in the form of (batch, sequence, features). In this example, the expected input/output tensor shape we want is (100, 28, 28). That is the batch size, the length of the sequence, and the number of features at each step; however, by default the nn.RNN class uses input/output tensors of the form (sequence, batch, features). Setting batch_first = True ensures the input/output tensors are of the shape (batch, sequence, features). In the forward method, we initialize a tensor for the hidden layer, h0, that is updated on every iteration of the model. The shape of this hidden tensor, representing the hidden state, is of the form (layers, batch, hidden). In this example, we have two layers. The second dimension of the hidden state is the batch size. Remember, we are using batch first so this is the first dimension of the input tensor, x, indexed using x[0]. The final dimension is the hidden size, which in this example we have set to 100. The nn.RNN class requires an input consisting of the input tensor, x, and the h0 hidden state. This is why in the forward method, we pass in these two variables. The forward method is called once every iteration, updating the hidden state and giving an output. Remember, number of iterations is the number of epochs multiplied by the data size divided by the batch size. Importantly, as you can see, we need to index the output using the following: out = self.fc(out[:, -1, :]) [ 100 ] Other NN Architectures Chapter 5 We are only interested in the output of the last time step, since this is the accumulated knowledge of all the images in the batch. If you remember, the output shape is of the form (batch, sequence, features) and in our model this is (100, 28, 100). The number of features is simply the number of hidden dimensions or number of units in the hidden layer. Now, we require all batches: this is why we use the colon as the first index. Here, -1 indicates we only want the last element of the sequence. The last index, the colon, indicates we want all of the features. Hence, our output is all the features of the last time step in the sequence, for one entire batch. We can use almost identical training code; however, we do need to reshape the output when we call the model. Remember that for linear models, we reshaped the output using the following: outputs = model(images.view(-1, 28*28) For convolution networks, by using nn.CNN we could pass in the unflattened image and for recurrent networks, when using nn.RNN we need the output to be of the form (batch, sequence, features). Therefore, we can use the following to reshape the output: outputs = model(images.view(-1, 28,28)) Remember, we need to change this line in both our training code and testing code. The following printout is the result of running three recurrent models using different layer and hidden size configurations: [ 101 ] Other NN Architectures Chapter 5 To get a better understanding of how this model works, consider the following diagram, representing our two-layer model with a hidden size of 100: At each of the 28 time steps the network takes an input, consisting of 28 pixels—the features—from each of the images in the 100 image batch. Each of the time steps are basically a two-layer feedforward network. The only difference is that there is an extra input to each of the hidden layers. This input consists of the output from the equivalent layer in the previous time step. At each time step, another 28 pixels are sampled from each of the 100 images in the batch. Finally, when the entirety of all the images in the batch have been processed, the weights of the model are updated and the next iteration begins. Once all iterations are complete, we read the output to obtain a prediction. [ 102 ] Other NN Architectures Chapter 5 To get a better understanding of what happens when we run the code, consider the following: Here, we print out the size of the weight vectors for a two-layer RNN model with a hidden layer size of 100. We retrieve the weights as a list containing 10 tensors. The first tensor of size [100, 28] consists of the inputs to the hidden layer, 100 units, and the 28 features, or pixels, of the input images. This is the Wx term in the vectorized form equation of the recurrent network. The next group of parameters, size [100, 100], represented by the Wy term in the preceding equation, is the output weights of the hidden layer, consisting of the 100 units each of size 100. The next two single-dimension tensors, each of size 100, are the bias units of the input and the hidden layer respectively. Next, we have the input weights, output weights, and biases of the second layer. Finally, we have the read out layer weights, a tensor of size [10, 100], for 10 possible predictions using 100 features. The final singledimension tensor of size [10] includes the bias units for the read out layer. In the following code, we have replicated what is happening in the recurrent layers of our model over a single batch of images: [ 103 ] Other NN Architectures Chapter 5 You can see that we have simply created an iterator out of the trainLoader dataset object and assigned an images variable to a batch of images, as we did for our training code. The hidden layer, h0, needs to contain two tensors, one for each layer. In each of these tensors, for each image in the batch, the weights of the 100 hidden units are stored. This explains why we need a three-dimensional tensor. The first dimension of size 2 for the number of layers, the second dimension is of size 100 for the batch size, obtained from images.size(0), and the third dimension is of size 100 for the number of hidden units. We then pass a reshaped image tensor and our hidden tensor to the model. This calls the model's forward() function, making the necessary computations, and returning two tensors an output tensor, and an updated hidden state tensor. The following confirms these output sizes: This should help you understand why we need to resize the images tensor. Note that the features for the input are the 28 pixels for each of the images in the batch, which are unrolled over the sequence of 28 time steps. Next, let's pass the output of the recurrent layer to our fully connected linear layer: [ 104 ] Other NN Architectures Chapter 5 You can see that this will give us 10 predictions for each of the 100 features present in the output. This is why we need to index only the last element in the sequence. Remember the output from nn.RNN is of size (100, 28, 100). Note what happens to the size of this tensor when we index it using -1: This is the tensor containing the 100 features, the outputs of the hidden units, for each of the 100 images in the batch. This is passed to our linear layer to give the required 10 predictions for each image. Long short-term memory networks Long short-term memory networks (LSTMS), are a special type of RNN capable of learning long-term dependencies. While standard RNNs can remember previous states to some extent, they did this on a fairly basic level by updating a hidden state on each time step. This enabled the network to remember short-term dependencies. The hidden state, being a function of previous states, retains information about these previous states. However, the more time steps there are between the current state and a previous state, it diminishes the effect that this earlier state will have on the current state. Far less information is retained on a state that is say 10 time steps before the time step immediately preceding the current step. This is despite that fact that earlier time steps may contain important information with direct relevance to a particular problem or task we are trying to solve. Biological brains have a remarkable ability to remember long-term dependencies, forming meaning and understanding using these dependencies. Consider how we follow the plot of a movie. We recall events that occurred at the beginning of the movie and immediately understand their relevance as the plot develops. Not only that, but we can apply context to the movie by recalling events in our own lives that give relevance and meaning to a story line. This ability to selectively apply memories to current context, yet at the same time filter out irrelevant details, is the strategy behind the design of LSTMs. [ 105 ] Other NN Architectures Chapter 5 An LSTM network is an attempt to incorporate these long-term dependencies into an artificial network. It is considerably more complex than a standard RNN; however, it is still based on recurrent feedforward networks and understanding this theory should enable you to understand LSTMs. The following diagram shows an LSTM over one single time step: [ 106 ] Other NN Architectures Chapter 5 As with normal RNNs, each subsequent time step takes the hidden state of the previous time step, ht-1, along with the data, xt, as its input. An LSTM also passes on a cell state that is calculated on each time step. You can see that ht-1 and xt are each passed to four separate linear functions. Each pair of these linear functions is summed. Central to an LSTM are the four gates that these summations are passed in to. First, we have the Forget Gate. This uses a sigmoid for activation and is element-wise multiplied by the Old Cell State. Remember the sigmoid is effectively squashing the linear output values to values between zero and one. Multiplying by zero will effectively eliminate that particular value in the cell state and multiplying by one will keep this value. The Forget Gate essentially decides what information is passed to the next time step. This is achieved by element-wise multiplication with the Old Cell State. The Input Gate and the Scaled new candidate gate together determine what information is retained. The Input Gate also uses a sigmoid function and this is multiplied by the output of a New Candidate gate, creating a temporary tensor, the Scaled new candidate, c2. Note that the New Candidate gate uses tanh activation. Remember the tanh function outputs a value between -1 and 1. Using tanh and sigmoid activation in such a way—that is, by element-wise multiplication of their outputs—helps prevent the vanishing gradient problem, where outputs become saturated and their gradients repeatedly become close to zero, making them unable to perform meaningful computations. A New Cell State is calculated by summing the Scaled new candidate with the Scaled Old Cell State, and in this way is able to amplify important components of the input data. The final gate, the output gate, O, is another sigmoid. The new cell state is passed through a tanh function and this is element-wise multiplied by the output gate to calculate the Hidden State. This Hidden State, as with standard RNNs, is passed through a final nonlinearity, a sigmoid, and a Softmax function to give the outputs. This has the overall effect of reinforcing high energy components, eliminating the lower energy components, as well as reducing the opportunities for vanishing gradients and reducing overfitting of the training set. We can write the equations for each of the LSTM gates as follows: [ 107 ] Other NN Architectures Chapter 5 Notice that these equations have an identical form to those of the RNN. The only difference is that we require eight separate weight tensors and eight bias tensors. It is these extra weight dimensions that give LSTMs their extra ability to learn and retain important features of the input data, as well as discard less important features. We can write the output of the linear output layer, of a particular time step, t, as the following: Implementing an LSTM The following is the LSTM model class we will use for MNIST: Notice that nn.LSTM is passed the same arguments as the previous RNN. This is not surprising, since LSTM is a recurrent network that works on sequences of data. Remember the input tensor has an axis of the form (batch, sequence, feature), so we set batch_first = True. We initialize a fully connected linear layer for the output layer. Notice in the forward method that, as well as initializing a hidden state tensor, h0, we also initialize a tensor to hold the cell state, c0. Remember also the out tensor contains all 28 time steps. For our prediction, we are only interested in the last index in the sequence. This is why we apply the [:, -1, :] indexing to the out tensor before passing it to the linear output layer. We can print out the parameters for this model in the same way as for the RNN previously: [ 108 ] Other NN Architectures Chapter 5 These are the parameters for a single-layer LSTM with 100 hidden layers. There are six groups of parameters for this single-layer LSTM. Notice that instead of the input and hidden weight tensors having a size of 100 in the first dimension, as was the case for the RNN, for an LSTM, this is a size of 400, representing 100 hidden units for each of the four LSTM gates. The first parameter tensor is for the input layer and is of size [400,28]. The first index, 400, corresponds to the weights w1, w3, w5, and w7, each of which is of size 100, for the inputs into the 100 hidden units specified. The 28 is the number of features, or pixels, present at the input. The next tensor, of size [400,100], are the weights w2, w4, w6, and w8 for each of 100 hidden units. The following two single-dimension tensors of size [400] are the two sets of bias units, b1, b3, b5, b7 and b2, b4, b6, b8, for each of the LSTM gates. Finally, we have the output tensor of size [10, 100]. This is our output size, 10, and the weight tensor w9. The last single-dimension tensor of size [10] is the bias, b9. Building a language model with a gated recurrent unit To demonstrate the flexibility of recurrent networks, we are going to do something different in the final section of this chapter. Up until now, we have been working with probably the most-used testing data set, MNIST. This dataset has characteristics that are well known and it is extremely useful for comparing different types of models and testing different architectures and parameter sets. However, there are some tasks, such as natural language processing, that quite obviously require an entirely different type of dataset. [ 109 ] Other NN Architectures Chapter 5 Also, the models we have built so far have been focused on one simple task: classification. This is the most straightforward machine learning task. To give you a flavor of other machine learning tasks, and to demonstrate the potential of recurrent networks, the model we are going to build is a character-based prediction model that attempts to predict each subsequent character based on the previous character, forming a learned body of text. The model first learns to create correct vowel—consonant sequences, words, and eventually sentences and paragraphs that mimic the form (but not the meaning) of those constructed by human authors. The following is an adaption of code written by Sean Robertson and Pratheek that can be found here: https://github.com/spro/practical-pytorch/blob/master/char-rnngeneration/char-rnn-generation.ipynb. Here is the model definition: [ 110 ] Other NN Architectures Chapter 5 The purpose of this model is to take an input character at each time step, and output the most likely next character. Over subsequent training, it begins to build up sequences of characters that mimic text from a training sample. Our input and output sizes are simply the number of characters in the input text, and this is calculated and passed in as a parameter to the model. We initialize an encoder tensor using the nn.Embedding class. In a similar way to how we used one hot encoding to define a unique index for each word, the nn.Embedding module stores each word in a vocabulary as a multidimensional tensor. This enables us to encode semantic information in the word embedding. We need to pass the nn.Embedding module a vocabulary size—here, this is the number of characters in the input text—and a dimensionality in which to encode each character—here, this is the hidden size of the model. The word embedding model we are using is based on the nn.GRU module, or GRU. This is very similar to the LSTM module we used in the previous section. The difference is that GRU is a slightly simplified version of LSTM. It combines the input and forget gates into a single update gate, and combines the hidden state with the cell state. The result is that the GRU is more efficient than LSTM for many tasks. Finally, a linear output layer is initialized to decode the output from the GRU. In the forward method, we resize the input and pass it through the linear embedding layer, the GRU, and the final linear output layer, returning the hidden state and the output. [ 111 ] Other NN Architectures Chapter 5 Next, we need to import the data, and initialize variables containing the printable characters of our input text and the number of characters in the input text. Note the use of unidecode to remove non-unicode characters. You will need to import this module and possibly install it on your system if it is not installed already. We also define two convenience functions: a function to convert a character string into the integer equivalent of each Unicode character, and another function to sample random chunks of training text. The random_training_set function returns two tensors. The inp tensor contains all characters in the chunk, excluding the last character. The target tensor contains all elements of the chunk offset by one and so includes the last character. For example, if we were using a chunk size of 4, and this chunk consisted of the Unicode character representations of [41, 27, 29, 51], then the inp tensor would be [41, 27, 29] and the target tensor [27, 29, 51]. In this way, the target can train a model to make a prediction on the next character using target data: [ 112 ] Other NN Architectures Chapter 5 Next, we write a method to evaluate the model. This is done by passing it one character at a time: the model outputs a multinomial probability distribution for the next most likely character. This is repeated to build up a sequence of characters, storing them in the predicted variable: The evaluate function takes a temperature argument that divides the output and finds the exponent to create a probability distribution. The temperature argument has the effect of determining the level of probability required for each prediction. For temperature values above 1, characters with lower probabilities are generated, the resulting text being more random. For lower temperature values below 1, higher probability characters are generated. With temperature values close to 0, only the most likely characters will be generated. For each iteration, a character is added to the predicted string until the required length, determined by the predict_len variable, is reached and the predicted string is returned. [ 113 ] Other NN Architectures Chapter 5 The following function trains the model: We pass it the input chunk and the target chunk. The for loop runs the model through one iteration for each character in the chunk, updating the hidden state and returning the average loss for each character. Now, we are ready to instantiate and run the model. This is done with the following code: [ 114 ] Other NN Architectures Chapter 5 Here, the usual variables are initialized. Notice that we are not using stochastic gradient descent for our optimizer, but rather use the Adam optimizer. The term Adam stands for adaptive moment estimator. Gradient descent uses a single fixed learning rate for all learnable parameters. The Adam optimizer uses an adaptive learning rate that maintains a per parameter learning rate. It can improve learning efficiency, particularly in sparse representations such as those used for natural language processing. Sparse representations are those where most of the values in a tensor are zero, for example in one-hot encoding or word embeddings. Once we run the model, it will print out the predicted text. At first, the text appears as almost random sequences of characters; however, after a few cycles of training, the model learns to format the text into English-like sentences and phrases. Generative models are powerful tools, enabling us to uncover probability distributions in input data. Summary In this chapter, we introduced recurrent neural networks and demonstrated how to use an RNN on the MNIST dataset. RNNs are particularly useful for working with time series data, since they are essentially feedforward networks that are unrolled over time. This makes them very suitable for tasks such as handwriting and speech recognition, as they operate on sequences of data. We also looked at a more powerful variant of the RNN, the LSTM. The LSTM uses four gates to decide what information to pass on to the next time step, enabling it to uncover long-term dependencies in data. Finally, in this chapter we built a simple language model, enabling us to generate text from sample input text. We used a model based on the GRU. The GRU is a slightly simplified version of the LSTM, containing three gates and combining the input and forget gates of the LSTM. This model used probability distributions to generate text from a sample input. In the final chapter, we will examine some advanced features of PyTorch, such as using PyTorch in multiprocessor and distributed environments. We also see how to fine-tune PyTorch models and use pre-trained models for flexible image classification. [ 115 ] 6 Getting the Most out of PyTorch By now, you should be able to build and train three different types of model: linear, convolutional, and recurrent. You should have an appreciation of the theory and mathematics behind these model architectures and explain how they make predictions. Convolutional networks are probably the most studied deep learning network, especially in relation to image data. Of course, both convolutional and recurrent networks make extensive use of linear layers, so the theory behind linear networks, most notably linear regression and gradient descent, is fundamental to all artificial neural networks. Our discussion so far has been fairly contained. We have looked at a well-studied problem, such as classification using MNIST, to give you a solid understanding of the basic PyTorch building blocks. This final chapter is the launching pad for your use of PyTorch in the real world, and after reading it you should be well placed to begin your own deep learning exploration. In this chapter, we will discuss the following topics: Using graphics processing units (GPUs) to improve performance Optimization strategies and techniques Using pretrained models Multiprocessor and distributed environments There are a variety of multiprocessor and distributed environment possibilities. The most common reason for using more than one processor is, of course, to make models run faster. The time it takes to load MNIST—a relatively tiny dataset of 60,000 images—to memory is not significant. However, consider the situation where we have giga or terabytes of data, or if the data is distributed across multiple servers. The situation is even more complex when we consider online models, where data is being harvested from multiple servers in real time. Clearly, some sort of parallel processing capability is required. Getting the Most out of PyTorch Chapter 6 Using a GPU The simplest way to make a model run faster is to add GPUs. A significant reduction in training time can be achieved by transferring processor-intensive tasks from the central processing unit (CPU) to one or more GPUs. PyTorch uses the torch.cuda() module to interface with the GPUs. CUDA is a parallel computing model created by NVIDIA that features lazy assignment so that resources are only allocated when needed. The resulting efficiency gains are substantial. PyTorch uses a context manager, torch.device(), to assign tensors to a particular device. The following screenshot shows an example of this: It is a more usual practice to test for a GPU and assign a device to a variable using the following semantics: device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") The "cuda:0" string refers to the default GPU device. Note that we test for the presence of a GPU device and assign it to the device variable. If a GPU device is unavailable, then the device is assigned to the CPU. This allows code to run on machines that may or may not have a GPU. Consider the linear model we explored in Chapter 3, Computational Graphs and Linear Models. We can use exactly the same model definition; however, we need to change a few things in our training code to ensure processor-intensive operations occur on the GPU. Once we have create our device variable, we can assign operations to that device. In the benchmark function we created earlier, we need to add the following line of code after we initialize the model: model.to(device) [ 117 ] Getting the Most out of PyTorch Chapter 6 We also need to ensure the operations on the images, labels, and outputs all occur on the selected device. In the for loop of the benchmark function, we make the following changes: We need do to exactly the same thing for the images, labels, and outputs defined in our accuracy function, simply appending .to(device) to these tensor definitions. Once these changes have been made, if it is run on a system with a GPU, it should run noticeably faster. For a model with four linear layers, this code ran in just over 55 seconds, compared to over 120 seconds when run just on the CPU on my system. Of course, CPU speed, memory, and other factors contribute to running time, so these benchmarks will be different on different systems. The exact same training code will work for a logistic regression model. The same modifications also work for the training code for the other networks we have studied. Almost anything can be transferred to a GPU, but be aware that there is a computational expense incurred every time data is copied to the GPU, so do not unnecessarily transfer operations to the GPU unless a complex computation—for example, calculating a gradient—is involved. If you have multiple GPUs available on your system, then nn.DataParallel can be used to transparently distribute operations across these GPUs. This can be as simple as using a wrapper around your model—for example, model=torch.nn.DataParallel(model). We can of course use a more granular approach and assign specific operations to specific devices, as shown in the following example: with torch.cuda.device("device:2"): w3=torch.rand(3,3) PyTorch has a specific memory space available to speed up the transfer of tensors to the GPU. This is used when a tensor is repeatedly allocated to a GPU. This is achieved using the pin_memory() function—for example, w3.pin_memory(). One of the major uses for this is to speed up the loading of input data, which occurs repeatedly over a model's training cycle. To do this, simply pass the pin_memory=True argument to the DataLoader object when it is instantiated. [ 118 ] Getting the Most out of PyTorch Chapter 6 Distributed environments Sometimes, data and computing resources are not available on a single physical machine. This requires protocols for exchanging tensor data over a network. With distributed environments, where computations can occur on different kinds of physical hardware over a network, there are a large number of considerations—for example, network latencies or errors, processor availability, scheduling and timing issues, and competing processing resources. In an ANN, it is essential that calculations are produced in a certain order. The complex machinery for the assigning and timing of each computation across networks of machines and processors in each machine is, thankfully, largely hidden in PyTorch using higher-level interfaces. PyTorch has two main packages, each of which deals the various aspects of distributed and parallel environments. This is in addition to CUDA, which we discussed previously. These packages are as follows: torch.distributed torch.multiprocessing torch.distributed Using torch.distributed is probably the most common approach. This package provides communication primitives, such as classes, to check the number of nodes in a network, ensure the availability of backend communication protocols, and initialize process groups. It works on the module level. The torch.nn.parallel.DistributedDataParallel() class is a container that wraps a PyTorch model, allowing it to inherit the functionality of torch.distributed. The most common use case involves multiple processes that each operate on their own GPU, either locally or over a network. A process group is initialized to a device using the following code: torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...') [ 119 ] Getting the Most out of PyTorch Chapter 6 This is run on each host. The backend specifies what communication protocols to use. The NCCL (pronounced nickel) backend is generally the fastest and most reliable. Be aware that this may need to be installed on your system. The world_size is the number of processes in the job and the init_method is a URL pointing to location and port for the process to be initialized. This can either be a network address—for example, (tcp://......)—or a shared filesystem (file://... /...). A device can be set using torch.cuda.set_devices(i). Finally, we can assign the model by using the code phrase model = distributedDataParallel(model, device_ids=[i], output_device=i. This is typically used in an initialization function that spawns each process and assigns it to a processor. This ensures that every process is coordinated through a master using the same IP address and port. torch.multiprocessing The torch.multiprocessor package is a replacement for the Python multiprocessor package, and is used in exactly the same way, that is, as a process-based threading interface. One of the ways it extends the Python distributed package is by placing PyTorch tensors into shared memory and only sending their handles to other processes. This is achieved using a multiprocessing.Queue object. In general, multiprocessing occurs asynchronously; that is, processes for a particular device are enqueued and executed when the process reaches the top of the queue. Each device executes a process in the order that it is queued and PyTorch periodically synchronizes the multiple processes when copying between devices. This means that, as far as the caller of a multi-process function is concerned, the processes occur synchronously. One of the major difficulties when writing multithreaded applications is avoiding deadlocking, where two processes compete for a single resource. A common reason for this is when background threads lock or import a module and a subprocess is forked. The subprocess will likely be spawned in a corrupted state, causing a deadlock or another error. The multiprocessingQueue class itself spawns multiple background threads to send, receive, and serialize objects, and these threads can also cause deadlocks. For these circumstances, the thread free multiprocessingQueue.queues.SimpleQueue can be used. [ 120 ] Getting the Most out of PyTorch Chapter 6 Optimization techniques The torch.optim package contains a number of optimization algorithms, and each of these algorithms has several parameters that we can use to fine-tune deep learning models. Optimization is a critical component in deep learning, so it is no surprise that different optimization techniques can be key to a model's performance. Remember, its role is to store and update the parameter state based on the calculated gradients of the loss function. Optimizer algorithms There are a number of optimization algorithms besides SGD available in PyTorch. The following code shows one such algorithm: optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0) The Adedelta algorithm is based on stochastic gradient descent; however, instead of having the same learning rate over each iteration, the learning rate adapts over time. The Adadelta algorithm maintains separate dynamic learning rates for each dimension. This can make training quicker and more efficient, as the overhead of calculating new learning rates on each iteration is quite small compared to actually calculating the gradients. The Adadelta algorithm performs well with noisy data for a range of model architectures, large gradients, and in distributed environments. The Adadelta algorithm is particularly effective with large models, and works well with large initial learning rates. There are two hyperparameters associated with Adadelta that we have not discussed yet. The rho is used to calculate the running averages of the squared gradients; this determines the decay rate. The eps hyperparameter is added to improve the numerical stability of Adadelta, as shown in the following code: optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulater_value=0) [ 121 ] Getting the Most out of PyTorch Chapter 6 The Adagrad algorithm, or adaptive subgradient methods for stochastic optimization, is a technique that incorporates geometric knowledge of the training data observed in earlier iterations. This allows the algorithm to find infrequent, but highly predictive, features. The Adagrad algorithm uses an adaptive learning rate that give frequently occurring features low learning rates and rare features higher learning rates. This has the effect of finding rare but important features of the data and calculating each gradient step accordingly. The learning rate decreases faster over each iteration for more frequent features, and slower for rarer features, meaning that rare features tend to maintain higher learning rates over more iterations. The Adagrad algorithm tends to work best for sparse datasets. An example of its application is shown in the following code: optim.Adam(params, lr=0.001, betas(0.9,0.999), eps=1e-08, weight_decay=0, amsgrad=False) The Adam algorithm (adaptive moment estimation) uses an adaptive learning rate based on the mean and the uncentered variance (the first and second moments) of the gradient. Like Adagrad, it stores the average of past squared gradients. It also stores the decaying average of these gradients. It calculates the learning rate on each iteration on a per-dimension basis. The Adam algorithm combines the benefits of Adagrad, working well on sparse gradients, with the ability to work well in online and nonstationary settings. Note that Adam takes an optional tuple of beta parameters. These are coefficients that are used in the calculation of the running average and the square of these averages. The amsgrad flag, when set to True, enables a variant of Adam that incorporates the long-term memory of gradients. This can assist with the convergence, where, in certain situations, the standard Adam algorithm fails to converge. In addition to the Adam algorithm, PyTorch contains two variants of Adam. The optim.SparseAdam performs lazy updating of parameters, where only the moments that appear in the gradient get updated and applied to the parameters. This provides a more efficient way of working with sparse tensors, such as those used for word embedding. The second variant, optim.Adamax, uses the infinite norm to calculate the gradients, and this, theoretically, reduces its susceptibility to noise. In practice, the choice of the best optimizer is often a matter of trial and error. The following code demonstrates the optim.RMSprop optimizer: optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered = False) [ 122 ] Getting the Most out of PyTorch Chapter 6 The RMSprop algorithm divides the learning rate for each parameter by a running average of the squares of the magnitude of recent gradients for that particular parameter. This ensures that the step size on each iteration is of the same scale as the gradient. This has the effect of stabilizing gradient descent and reduces the problem of disappearing or exploding gradients. The alpha hyperparameter is a smoothing parameter that helps make the network resilient to noise. Its use can be seen in the following code: optim.Rprop(params, lr=0.01, etas(0.5,1.2), step_sizes(1e_06,50)) The Rprop algorithm (resilient back propagation) is an adaptive algorithm that calculates weight updates by using the sign, but not the magnitude, of the partial derivative of the cost function for each weight. These are calculated for each weight independently. The Rprop algorithm takes a tuple pair of arguments, etas. These are multiplicative factors that either increase or decrease the weight depending on the sign of the derivative calculated on the entire loss function of the previous iteration. If the last iteration produced the opposite sign as the current derivative, then the update is multiplied by the first value in the tuple, called etaminus, a value less than one and defaulting to 0.5. If the sign is the same on the current iteration, then that weight update is multiplied by the second value in the etas tuple, called etaplis, a value greater than 1 and defaulting to 1.2. In this way, the total error function is minimized. Learning rate scheduler The torch.optim.lr_schedular class serves as a wrapper around an to schedule the learning rate according to a specific function multiplied by the initial learning rate. The learning rate scheduler can be applied separately to each parameter group. This can speed up training time since, typically, we are able to use larger learning rates at the beginning of the training cycle and shrink this rate as the optimizer approaches minimal loss. Once a scheduler object is defined, it is typically stepped every epoch using scheduler.step(). There are a number of learning rate scheduler classes available in PyTorch, and the most common one is shown in the following code: optim.lr_schedular.LambdaLR(optimizer, lr_lambda, last_epoch =-1) This learning rate scheduler class takes a function that multiplies the initial learning rate of each parameter group, and is either passed as a single function or a list of functions if there is more than one parameter group. The last_epoch is the index of the last epoch, so the default, -1, is the initial learning rate. The following screenshot of an example of this class assumes that we have two parameter groups: [ 123 ] Getting the Most out of PyTorch Chapter 6 optim.lr_schedular.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1 decays the learning rate by a multiplicative factor, gamma, every step_size of epochs. optim.lr_schedular.MultiStepLR(optimizer, milestones, gamma=0.1,last_epoch=-1) takes a list of milestones, measured in the number of epochs, when the learning rate is decayed by gamma. The milestones phrase is an increasing list of epoch indices. Parameter groups When an optimizer is instantiated, it is the as well as a variety of hyperparameters such as the learning rate. Optimizers are also passed other hyperparameters specific to each optimization algorithm. It can be extremely useful to set up groups of these hyperparameters, which can be applied to different parts of the model. This can be achieved by creating a parameter group, essentially a list of dictionaries that can be passed to the optimizer. The param variable must either be an iterator over a torch.tensor or a Python dictionary specifying a default value of optimization options. Note that the parameters themselves need to be specified as an ordered collection, such as a list, so that parameters are a consistent sequence between model runs. It is possible to specify the parameters as a parameter group. Consider the code shown in the following screenshot: [ 124 ] Getting the Most out of PyTorch Chapter 6 The param_groups function returns a list of dictionaries containing the weights and the optimizer hyperparameters. We have already discussed the learning rate. The SGD optimizer also has several other hyperparameters that can be used to fine-tune your models. The momentum hyperparameter modifies the SGD algorithm to help accelerate gradient tensors towards the optimum, usually leading to faster convergence. The momentum defaults to 0; however, using higher values, usually around 0.9, often results in faster optimization. This is especially effective on noisy data. It works by calculating a moving average across the dataset, effectively smoothing the data and consequently improving optimization. The dampening parameter can be used in conjunction with momentum as a dampening factor. The weight_decay parameter applies L2 regularization. This adds a term to the loss function, with the effect of shrinking the parameter estimates, making the model simpler and less likely to overfit. Finally, the nestrove parameter calculates the momentum based on future weight predictions. This enables the algorithm to look ahead by calculating a gradient, not with respect to current parameters, but with respect to approximate future parameters. We can use the param_groups function to assign different sets of parameters to each parameter group. Consider the code shown in the following screenshot: Here, we have created another weight, w2, and assigned it to a parameter group. Note that in the output we have two sets of hyperparameters, one for each parameter group. This enables us to set weight-specific hyperparameters, allowing, for example, different options to be applied to each layer in a network. We can access each parameter group and change a parameter value, using its list index, as shown in the code in the following screenshot: [ 125 ] Getting the Most out of PyTorch Chapter 6 Pretrained models One of the major difficulties with image classification models is the lack of labeled data. It is difficult to assemble a labeled dataset of sufficient size to train a model well; it is an extremely time consuming and laborious task. This is not such a problem for MNIST, since the images are relatively simple. They are greyscale and largely consist only of target features, there are no distracting background features, and the images are all aligned the same way and are of the same scale. A small dataset of 60,000 images is quite sufficient to train a model well. It is rare to find such a well-organized and consistent dataset in the problems we encounter in real life. Images are often of variable quality, and the target features can be obscured or distorted. They can also be of widely variable scales and rotations. The solution is to use a model architecture that is pretrained on a very large dataset. PyTorch includes six model architectures based on convolutional networks, designed for working with images on classification or regression tasks. The following list describes these models in detail: AlexNet: This model is based on convolutional networks and achieves significant performance improvements through a strategy of parallelizing operations across processors. The reason for this is that operations on the convolutional layers are somewhat different to those that occur on the linear layers of a convolutional network. The convolutional layers account for around 90% of the overall computation, but operate on only 55% of the parameters. For the fully connected linear layers, the reverse is true, accounting for around 5% of the computations, yet they contain around 95% of the parameters. AlexNet uses a different parellelizing strategy to take into account the differences between linear and convolutional layers. VGG: The basic strategy behind very deep convolutional networks (VGG) for large-scale image recognition is to increase the depth the number of layers—while using a very small filter with a receptive field of 3 x 3 for all convolutional layers. All hidden layers include ReLU nonlinearity, and the output layers consist of three fully connected linear layers and a softmax layer. The VGG architecture is available in the vgg11, vgg13, vgg16, vgg19, vgg 11_bn, vgg13_bn, vgg16_bn, and vgg19_bn variants. [ 126 ] Getting the Most out of PyTorch Chapter 6 ResNet: While very deep networks offer potentially greater computation power, they can be very difficult to optimize and train. Very deep networks often result in gradients that either vanish or explode. ResNet uses a residual network that includes shortcut skip connections to jump over some layers. These skip layers have variable weights so that in the initial training phase the network effectively collapses into a few layers, and as training proceeds, the number of layers is expanded as new features are learned. Resnet is available in the resnet18, resnet34, resnet50, resnet101, and resnet152 variants. SqueezeNet: SqueezeNet was designed to create smaller models with fewer parameters that are easier to export and run in distributed environments. This is achieved using three strategies. Firstly, it reduces the receptive field of the majority of convolutions from 3 x 3 to 1 X 1. Secondly, it reduces the input channels into the remaining 3 x 3 filters. Thirdly, it down samples in the final layers in the network. SqueezeNet is available in the squeezenet1_0 and squeezenet1_1 variants. DenseNet: Densely convolutional networks—in contrast to standard CNNs, where weights propagate through each layer from input to output— each layer, the feature maps for all preceding layers are used as inputs. This results in shorter connections between layers and a network that encourages the reuse of parameters. This results in fewer parameters and strengthens the propagation of features. DenseNet is available in the Densenet121, Densenet169, and Densenet201 variants. Inception: This architecture uses several strategies to improve performance, including reducing informational bottlenecks by gently reducing dimensionality between the input and output, factorizing convolutions from larger to smaller receptive fields, and balancing the width and depth of the network. The latest version is inception_v3. Importantly, Inception requires images to be of size 299 x 299, in contrast to the other models, which require images to be of size 224 x 224. These models can be initialized with random weights by simply calling their constructor, for example model = resnet18(). To initialize a pre-trained model, set the Boolean pretrained= True, for example, model = resnet18(pretrained=True). This will load the dataset with their weight values pre-loaded. These weights are calculated by training the network on the Imagenet dataset. This dataset contains over 14 million images with over 100 indexes. [ 127 ] Getting the Most out of PyTorch Chapter 6 Many of these model architectures come in several configurations—for example, resnet18, resnet34, vgg11, and vgg13. These variants exploit differences in layer depth, normalization strategies, and other hyperparameters. Finding which one works best for a particular task requires some experimentation. Also, be aware that these models are designed for working with image data, and require RGB images in the form of (3, W, H). Input images need to be resized to 224 x 224, except for Inception, which requires images of size 299 x 299. Importantly, they need to be normalized in a very specific way. This can be done by creating a normalize variable and passing it to torch.utils.data.DataLoader, usually as part of a transforms.compose() object. It is important that the normalize variable is given exactly the following values: normalize=transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) This ensures that the input images have the same distribution as the Imagenet set that they were trained on. Implementing a pretrained model Remember the Guiseppe toys dataset we played with in Chapter 1, Introduction to PyTorch? We now finally have the tools and knowledge to be able to create a classification model for this data. We are going to do this by using a model pretrained on the Imagenet dataset. This is called transfer learning, because we are transferring the learning achieved on one dataset to make predictions on a different, usually much smaller, dataset. Using a network with pretrained weights dramatically increases its performance on much smaller datasets, and this is surprisingly easy to achieve. In the simplest case, we can pass the pretrained model a data of labeled images and simply change the number of output features. Remember that Imagenet has 100 indexes or potential labels. For our task here, we want to categorize images into three classes: toy, notoy, and scenes. For this reason, we need to assign the number of output features to three. The code in the following screenshot is an adaption of code from the transfer learning tutorial by Sasank Chilamkurthy, found at https://chsasank.github.io. [ 128 ] Getting the Most out of PyTorch Chapter 6 To begin, we need to import the data. This is available from this book's website (.../toydata). Unzip this file into your working directory. You can actually use any image data you like, provided it has the same directory structure: that is two subdirectories for training and validation sets, and within these two directories, subdirectories for each of the classes. Other datasets you might like to try are the hymenoptera dataset, containing two classes of either ants or bees, available from https://download.pytorch.org/ tutorial/hymenoptera_data.zip, and the CIFAR-10 dataset from torchvision/datasets, or the much larger and more challenging plant seedling dataset, containing 12 classes, available from https://www.kaggle.com/c/plant-seedlingsclassification. We need to apply separate data transformations for training and validation datasets, import and make the datasets iterable, and then assign the device to a GPU, if available, as shown in the code in the following screenshot: [ 129 ] Getting the Most out of PyTorch Chapter 6 Note that a dictionary is used to store two lists of compose objects in order to transform the training and validation sets. The RandomResizedCrop and RandomHorizontalFlip transforms are used to augment the training set. For both the training and validation sets, the images are resized and center cropped, and the specific normalization values, as discussed in the last section, are applied. The data is unpacked using a dictionary comprehension. This uses the datasets.Imagefolder class, which is a generic data loader for use where the data is organized into their class folders. In this case, we have three folders, NoToy, Scenes, and SingleToy, for their respective classes. This directory structure is replicated in both the val and train directories. There are 117 training images and 24 validation images, divided into the three classes. We can retrieve the class names simply by calling the classes attribute of the ImageFolder, as shown in the code in the following screenshot: A batch of images and their class indexes can be retrieved using the code in the following screenshot: The inputs tensor has a size in the form of (batch, RGB, W,H). The first tensor, of size 4, contains either a 0 (NoToy), 1 (Scenes), or 2 (SingleToy), representing the class for each of the 4 images in the batch. The class names of each image in the batch can be retrieved using the following list comprehension: [ 130 ] Getting the Most out of PyTorch Chapter 6 Now, let's look at the function that is used to train the model. This has a similar structure to our earlier training code, with a few additions. Training is divided into two phases, train and val. Also, the learning rate scheduler needs to be stepped for every epoch in the train phase, as shown in the code in the following screenshot: [ 131 ] Getting the Most out of PyTorch Chapter 6 The train_model function takes as arguments the model, the loss criteria, a learning rate scheduler, and the number of epochs. The model weights are stored by deep copying model.state_dict(). Deep copying this ensures that all elements of the state dictionary are copied, and not just referenced, into the best_model_wts variable. For every epoch there are two phases, a training phase and a validation phase. In the validation phase, the model is set to evaluation mode using model.eval(). This changes the behaviour of some model layers, typically the dropout layer, setting the dropout probability to zero to validate on the complete model. The accuracy and loss for both the training and validation phases are printed on each epoch. Once this is done, the best validation accuracy is printed. Before we can run the training code, we need to instantiate the model and set up the optimizer, loss criteria, and learning rate scheduler. Here, we use the resnet18 model, as shown in the code in the following screenshot. This code works for all resnet variants, although not necessarily with the same accuracy: The model is used with all weights, excluding the output layer, trained on the Imagenet dataset. We need only change the output layer since the weights in all hidden layers are frozen in their pretrained state. This is done by setting the output layer to a linear layer with its output set to the number of classes we are predicting. The output layer is essentially a feature extractor for the dataset we are working with. At the output, the features we are trying to extract are the classes themselves. We can look at the structure of the model by simply running print(model). The final layer is named fc, so we can access this layer with model.fc. This is assigned a linear layer and is passed the number of input features, accessed with fc.in_features, and the number of output classes, here set to 3. When we run this model, we are able to achieve an accuracy of around 90%, which is actually quite impressive, considering the tiny dataset we are using. This is possible because most of the training, apart from the final layer, is done on a much larger training set. [ 132 ] Getting the Most out of PyTorch Chapter 6 It is possible, and a worthwhile exercise, to use the other pretrained models with a few changes to the training code. For example, the DenseNet model can be directly substituted for ResNet by simply changing the name of the output layer from fc to classifier, so instead of writing model.fc, we write model.classifier. SqueezeNet, VGG, and AlexNet have their final layers wrapped inside a sequential container, so to change the output fc layer, we need to go through the following four steps: 1. Find the number of filters in the output layer 2. Convert the layers in the sequential object to a list and remove the last element 3. Add the last linear layer, specifying the number of output classes, to the end of the list 4. Convert the list back to a sequential container and add it to the model class For the vgg11 model, the following code can be used to implement these four steps: Summary Now that you have an understanding of the foundations of deep learning, you should be well placed to apply this knowledge to specific learning problems that you are interested in. In this chapter, we have developed an out-of-the-box solution for image classification using pretrained models. As you have seen, this is quite simple to implement, and can be applied to almost any image classification problem you can think of. Of course, the actual performance in each situation will depend on the number and quality of images presented, as well as the precise tuning of the hyperparameters associated with each model and task. [ 133 ] Getting the Most out of PyTorch Chapter 6 You can generally get very good results on most image classification tasks by simply running the pretrained models with default parameters. This requires no theoretical knowledge, apart from installing the programs' running environment. You will find that when you adjust some parameters, you may improve the network's training time and/or accuracy. For example, you may have noticed that increasing the learning rate may dramatically improve a model's performance over a small number of epochs, but over subsequent epochs the accuracy actually declines. This is an example of gradient descent overshooting, and failing to find the true optimum. Finding the best learning rate requires some knowledge of gradient descent. In order to get the most out of PyTorch and apply it in different problem domains—such as language processing, physical modeling, weather and climate prediction, and so on (the applications are almost endless)—you need to have some understanding of the theory behind these algorithms. This not only allows improvement on known tasks, such as image classification, but also gives you some insight into how deep learning might be applied in a situation where, for example, the input data is a time series and the task is to predict the next sequence. After reading this book, you should know the solution, which is of course to use a recurrent network. You would have noticed that the model we built to generate text—that is, to make predictions on a sequence—was quite different to the model used to make predictions on static image data. But what about the model you would have to build to help you gain insights into a particular process? This could be the electronic traffic on a website, the physical traffic on a road network, the carbon and oxygen cycle of the planet, or a human biological system. These are the frontiers of deep learning, with immense power to do good. I hope reading this short introduction has left you feeling empowered and inspired to begin exploring some of these applications. [ 134 ] Other Books You May Enjoy If you enjoyed this book, you may be interested in these other books by Packt: Deep Learning with PyTorch Vishnu Subramanian ISBN: 978-1-78862-433-6 Use PyTorch for GPU-accelerated tensor computations Build custom datasets and data loaders for images and test the models using torchvision and torchtext Build an image classifier by implementing CNN architectures using PyTorch Build systems that do text classification and language modeling using RNN, LSTM, and GRU Learn advanced CNN architectures such as ResNet, Inception, Densenet, and learn how to use them for transfer learning Learn how to mix multiple models for a powerful ensemble model Generate new images using GANs and generate artistic images using style transfer Other Books You May Enjoy TensorFlow Machine Learning Cookbook - Second Edition Nick McClure ISBN: 978-1-78913-168-0 Become familiar with the basic features of the TensorFlow library Get to know Linear Regression techniques with TensorFlow Learn SVMs with hands-on recipes Implement neural networks to improve predictive modeling Apply NLP and sentiment analysis to your data Master CNN and RNN through practical recipes Implement the gradient boosted random forest to predict housing prices Take TensorFlow into production [ 136 ] Other Books You May Enjoy Leave a review - let other readers know what you think Please share your thoughts on this book with others by leaving a review on the site that you bought it from. If you purchased the book from Amazon, please leave us an honest review on this book's Amazon page. This is vital so that other potential readers can see and use your unbiased opinion to make purchasing decisions, we can understand what our customers think about our products, and our authors can see your feedback on the title that they have worked with Packt to create. It will only take a few minutes of your time, but is valuable to other potential customers, our authors, and Packt. Thank you! [ 137 ] Index A activation functions 71, 72 agent 36 AlexNet 126 Amazon Web Services (AWS) 13 Anaconda distribution, of Python reference 9 artificial neural networks (ANNs) about 54 perceptron 55, 56, 57 autograd package about 61, 62, 63 computational graph 63, 64 B backpropagation 54 basic PyTorch operations about 13, 14 default value initialization 14, 15 in place operations 20, 21 indexing 18 NumPy array, converting to tensors 15 reshaping 19 slicing 18 tensor, converting to NumPy array 16 batch gradient descent (BGD) 48 batch normalization about 94, 95 reference 94 benchmarking models 81, 82, 83, 84, 85 binary classification 36 binary cross-entropy loss 70 C categories handling 39 central processing unit (CPU) 117 classification about 36 binary classification 36 multi-label classification 37 multiple output classification 37 classifier evaluating 37, 38 cloud server hosts Amazon Web Services (AWS) 13 Digital Ocean 11 clustering 35 computational graph 63, 64 convolution kernel 86 convolutional networks about 86 multiple convolutional layers 89 single convolutional layer 86, 87, 88 cost function 45 D data loading 21, 22 datasets concatenating 30 default value initialization 14, 15 dense word embedding 40 Digital Ocean about 11 droplet setup 11 reference 11 dimensionality reduction 35 distributed environments about 116, 119 torch.distributed 119 torch.multiprocessing 120 F L feature map 86 feature scaling min-max scaling 39 normalization 39 standardization 39 techniques 39 feature vector 49 features 38 filter 86 learning rate 47 learning rate scheduler 123, 124 learning tasks supervised learning 34 unsupervised learning 34 linear algebra 40 linear models about 44, 64 activation functions 71, 72 gradient descent 46, 47 linear regression 64, 65, 67, 68 logistic regression 69, 70, 71 multiple features 49 normal equation 50 saving 68 logistic regression 50, 52, 53 long short-term memory networks (LSTMS) about 105, 107, 108 implementing 108, 109 language model, building with gated recurrent unit 109, 111, 113, 114, 115 G gated recurrent unit language model, building 109, 111, 113, 114, 115 gradient descent 46, 47 graphical processing units (GPUs) about 8 using 117, 118 H hymenoptera dataset reference 129 hyperparameter 47, 79, 80, 81 hypothesis function 44 I ImageFolder 29 in place operations 20 Inception 127 indexing 18 internal co-variate shift (ICS) 94 IPython about 12 tunneling in to 12 J just in time (JIT) C++ compiler 9 K k-means 35 M machine learning approaches 33, 34 matrix about 41 example 42, 43, 44 mean squared error (MSE) 45 models about 40 linear models 44 multi-class classification example 72, 73, 74, 75, 76, 77 multi-label classification 37 multilayered networks 79, 80, 81 multiple convolutional layers about 89 multiple-layer CNN, building 92, 93, 94 pooling layers 89, 90 single-layer CNN, building 90, 91, 92 multiple features 49 multiple kernels 88, 89 [ 139 ] multiple output classification 37 multiple-layer CNN batch normalization 94, 95 building 92, 93, 94 multiprocessor about 116 graphical processing units (GPUs), using 117, 118 N nonlinear models 53, 54 normal equation 50 NumPy arrays converting, to tensors 15 O Open Neural Network Exchange (ONNX) 8 optimization techniques about 121 learning rate scheduler 123, 124 parameter groups 125 optimizer algorithms 121, 122, 123 P parameter groups 125 parameter vector 49 plant seedling dataset reference 129 pretrained model about 126, 127, 128 AlexNet 126 DenseNet 127 implementing 128, 130, 132, 133 Inception 127 ResNet 127 SqueezeNet 127 very deep convolutional networks (VGG) 126 principle component analysis (PCA) 35 PyTorch dataset loaders about 23 custom dataset, creating 26, 28 DataLoader 25 image, loading 25 transforms 28 PyTorch about 7 advantage 8 characteristics 9 installing 9 reference 10 R receptive field 86 rectified linear unit (ReLU) 72 recurrent artificial neurons (RANs) 98, 99 recurrent networks about 97 implementing 99, 100, 101, 102, 103, 104, 105 reinforcement learning 36 reshaping 18 ResNet 127 S single convolutional layer about 86, 87, 88 multiple kernels 88, 89 single-layer CNN building 90, 91, 92 slicing 18 Spyder 10 SqueezeNet 127 stochastic gradient descent (SGD) 48 stores perceptron 58 supervised learning about 34, 36 classification 36 T tape-based auto-diif 8 tensor converting, to NumPy array 16 text handling 39 torch.distributed 119 torch.multiprocessing 120 U unit 54 [ 140 ] unsupervised learning about 34, 35 clustering 35 principle component analysis (PCA) 35 reinforcement learning 36 V valid padding 87 vdumoulin's excellent animations reference 89 very deep convolutional networks (VGG) 126
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No XMP Toolkit : Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03 Create Date : 2018:12:21 06:35:12Z Modify Date : 2018:12:21 12:08:18+05:30 Metadata Date : 2018:12:21 12:08:18+05:30 Producer : mPDF 6.0 Format : application/pdf Document ID : uuid:f6437fb8-ce7b-4d9b-acda-d4e345fd344b Instance ID : uuid:c182b34a-64a6-45e5-914c-83ea9106624c Page Layout : OneColumn Page Mode : UseOutlines Page Count : 150EXIF Metadata provided by EXIF.tools