, )
The first element of these tuples is the column name as a string. The second element
is the dtype for the column, which may itself be another compound dtype. Thus, you
can have subtables as part of your table. The third element of the tuple is optional; if
present, it is an integer representing the number of elements that the column should
have. If the number is not provided, a default value of 1 is assumed. Compound
dtypes are similar in nature to SQL schemas or a CSV file’s header line. Here are
some simple examples:
220
|
Chapter 9: NumPy: Thinking in Arrays
www.it-ebooks.info
Code
Returns
# a simple flat dtype
fluid = np.dtype([
('x', int),
('y', np.int64),
('rho', 'f8'),
('vel', 'f8'),
])
dtype([('x', '
You can create structured arrays by passing these data types into the array creation
functions as usual. Note that in some cases, such as for arange(), the dtype that you
pass in may not make sense. In such cases, the operation will fail. Functions such as
zeros(), ones(), and empty() can take all data types. For example:
Structured Arrays
www.it-ebooks.info
|
221
Code
Returns
np.zeros(4, dtype=particles)
array([((0, 0, 0), 0.0, [0.0, 0.0, 0.0]),
((0, 0, 0), 0.0, [0.0, 0.0, 0.0]),
((0, 0, 0), 0.0, [0.0, 0.0, 0.0]),
((0, 0, 0), 0.0, [0.0, 0.0, 0.0])],
dtype=[('pos', [('x', '>)
minimum(a, b)
Minimum (note that this is different from np.min())
224
|
Chapter 9: NumPy: Thinking in Arrays
www.it-ebooks.info
Function
Description
maximum(a, b)
Maximum (note that this is different from np.max())
isreal(a)
Test for zero imaginary component
iscomplex(a)
Test for zero real component
isfinite(a)
Test for noninfinite value
isinf(a)
Test for infinite value
isnan(a)
Test for Not a Number
floor(a)
Next-lowest integer
ceil(a)
Next-highest integer
trunc(a)
Truncate, remove noninteger bits
For example, we can take the sine of the linear range from zero to pi as follows:
Code
Returns
x = np.linspace(0.0, np.pi, 5)
array([0.
, 0.78539816,
1.57079633, 2.35619449,
3.14159265])
np.sin(x)
array([0.00000000e+00, 7.07106781e-01,
1.00000000e+00, 7.07106781e-01,
1.22464680e-16])
Universal functions are very significant in NumPy. One brilliant aspect of NumPy’s
design is that even though they are fundamental to many common operations, as a
user, you will almost never even notice that you are calling a universal function. They
just work.
It is common for new users of NumPy to use Python’s standard
math module instead of the corresponding universal functions. The
math module should be avoided with NumPy because it is slower
and less flexible. These deficiencies are primarily because universal
functions are built around the idea of arrays while math is built
around the Python float type.
However, not every operation can be expressed solely using univer‐
sal functions. Up next is a section that teaches about the vital odds
and ends that have yet to be detailed.
Universal Functions
www.it-ebooks.info
|
225
Other Valuable Functions
In addition to the suite of ufuncs, NumPy also provides some miscellaneous func‐
tions that are critical for day-to-day use. In most cases these are self-explanatory; for
instance, the sum() function sums elements in an array. Many of these allow you to
supply keyword arguments. A common keyword argument is axis, which is None by
default, indicating that these functions will operate over the entire array. However, if
axis is an integer or tuple of integers, the function will operate only over those
dimensions. Using sum() as an example:
Code
Returns
a = np.arange(9)
a.shape = (3, 3)
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
np.sum(a)
36
np.sum(a, axis=0)
array([ 9, 12, 15])
np.sum(a, axis=1)
array([ 3, 12, 21])
Many of these functions appear as methods on the ndarray class as well. Table 9-5
shows some of the most important global functions that NumPy provides. Please
refer to the NumPy documentation for more information.
Table 9-5. Important NumPy global functions
function
Description
sum(a)
Adds together all array elements.
prod(a)
Multiplies together all array elements.
min(a)
Returns the smallest element in the array.
max(a)
Returns the largest element in the array.
argmin(a)
Returns the location (index) of the minimum element.
argmax(a)
Returns the location (index) of the maximum element.
dot(a, b)
Computes the dot product of two arrays.
cross(a, b)
Computes the cross product of two arrays.
einsum(subs, arrs)
Computes the Einstein summation over subscripts and a list of arrays.
226
|
Chapter 9: NumPy: Thinking in Arrays
www.it-ebooks.info
function
Description
mean(a)
Computes the mean value of the array elements.
median(a)
Computes the median value of the array elements.
average(a, weights=None) Returns the weighted average of an array.
std(a)
Returns the standard deviation of an array.
var(a)
Computes the variance of an array.
unique(a)
Returns the sorted unique elements of an array.
asarray(a, dtype)
Ensures the array is of a given dtype. If the array is already in the specified dtype,
no copy is made.
atleast_1d(a)
Ensures that the array is least one-dimensional.
atleast_2d(a)
Ensures that the array is least two-dimensional.
atleast_3d(a)
Ensures that the array is least three-dimensional.
append(a, b)
Glues the values of two arrays together in a new array.
save(file, a)
Saves an array to disk.
load(file)
Loads an array from disk.
memmap(file)
Loads an array from disk lazily.
These functions can and do help, and using NumPy to its fullest often requires know‐
ing them. Some of them you will likely reach for very soon, like sum(). Others may
only rear their heads once or twice in a project, like save(). However, in all cases you
will be glad that they exist when you need them.
NumPy Wrap-up
Congratulations! You now have a breadth of understanding about NumPy. More
importantly, you now have the basic skills required to approach any array data lan‐
guage. They all share common themes on how to think about and manipulate arrays
of data. Though the particulars of the syntax may vary between languages, the under‐
lying concepts are the same. For NumPy in particular, though, you should now be
comfortable with the following ideas:
NumPy Wrap-up
www.it-ebooks.info
|
227
• Arrays have an associated data type or dtype.
• Arrays are fixed-length, though their contents are mutable.
• Manipulating array attributes changes how you view the array, not the data itself.
• Slices are views and do not copy data.
• Fancy indexes are more general than slices but do copy data.
• Comparison operators return masks.
• Broadcasting stretches an array along applicable dimensions.
• Structured arrays use compound dtypes to represent tables.
• Universal and other functions are helpful for day-to-day NumPy use.
Still, NumPy arrays typically live only in memory. This means while it is a good men‐
tal model for performing calculations and trying to solve problems, NumPy is typi‐
cally not the right tool for storing data and sharing it with your friends and
colleagues. For those tasks, we will need to explore the tools coming up in Chap‐
ter 10.
228
|
Chapter 9: NumPy: Thinking in Arrays
www.it-ebooks.info
CHAPTER 10
Storing Data: Files and HDF5
HDF5 stands for Hierarchical Data Format 5, a free and open source binary file type
specification. HDF5 is built and supported by the HDF Group, which is an organiza‐
tion that split off from the University of Illinois Champagne-Urbana. What makes
HDF5 great is the numerous libraries written to interact with files of this type and its
extremely rich feature set.
HDF5 has become the default binary database for scientific computing. Unlike other
software developers, scientists tend not to be primarily concerned with variablelength strings, and our data is highly structured. What sets our data apart is the sheer
quantity of it.
The Big Data regime often deals with tables that have millions to billions of rows. The
cutting edge of computational science is trying to figure how out to deal with data on
the order of 1016 to 1018. HDF5 is at the forefront of tackling this quantity of data. At
this volume, data earns the term exascale because the size is roughly 1 exabyte. An
exabyte is almost unimaginably large. And at this scale, any improvements that can be
made to the storage size per element are worth implementing.
The beauty of HDF5 is that it works equally well on gargantuan data as it does on tiny
datasets. This allows users to play around with subsets of their data on their laptops
and then seamlessly deploy to the largest computers ever built, and everything in
between.
A contributing factor to the popularity of HDF5 is that it is accessible from almost
anywhere. The HDF Group supports interfaces in C, Fortran, Java, and C++ (mostly
deprecated; use the C interface instead). The C interface is the default and most fully
featured API. Third-party packages that interface with HDF5 are available in MAT‐
LAB, Mathematica, Haskell, and others. Python has two packages for using HDF5:
229
www.it-ebooks.info
h5py and PyTables. Here, we will use PyTables and occasionally reference aspects of
the C interface.
PyTables Versus h5py
Note that we have chosen PyTables here because its adds further
querying capabilities to the HDF5 interface. These are important to
learn about, as advanced querying comes up frequently in database
theory. On the other hand, h5py exposes HDF5’s underlying paral‐
lelism features directly to the user. For general use where you may
want to ask sophisticated questions of your data, go with PyTables.
In cases where you have large amounts of data that you don’t need
to question too deeply, go with h5py. For an excellent book on
h5py, please see Andrew Collette’s Python and HDF5 (O’Reilly).
Before we can shoot for storing astronomical datasets, we first have to learn how
normal-sized files are handled. Python has a lot of great tools for handling files of
various types, since they are how most data—not just physics data—is stored. Having
an understanding of how Python handles most files is needed to fully intuit how
large-scale databases like HDF5 work. Thus, this chapter starts out with an overview
section on normal files before proceeding on to HDF5 and all of its fancy features.
Files in Python
So far in this book, the discussion has revolved around in-memory operations. How‐
ever, a real computer typically has a hard drive for long-term persistent storage. The
operating system abstracts this into a collection of files. There are many situations in
which you may need to interact with a file on your hard drive:
• Your collaborator emails you raw data. You download the attachment and want
to look at the results.
• You want to email your collaborators some of your data, quickly.
• You need to use external code that takes an input or data file. You may need to
run the program thousands of times, so you automate the generation of input
files from data that you have in-memory in Python.
• An external program that you use writes out one (or more) result files, and you
want to read them and perform further analysis.
• You want to keep an intermediate calculation around for debugging or validation.
Reading and writing files is about interacting with the outside world. The senders and
receivers in these interactions can be humans, other programs, or both. Files provide
a common object that enables these interactions. That said, files are further special‐
230
|
Chapter 10: Storing Data: Files and HDF5
www.it-ebooks.info
ized into formats, such as .csv, .doc, .json, .mp3, .png, and so on. These formats denote
the internal structure of the file. The .txt extension is an exception in that it is not a
format; it is traditionally used to flag that a file has no specific internal structure but
contains plain, free-flowing text. We do not have the time or space in this book to
fully describe even a fraction of the popular file formats. However, Python will be able
to open all of them (if not necessarily make sense of their internal structure).
In Python, to save or load data you go through a special file handle object. The builtin open() function will return a file object for you. This takes as its argument the path
to the file as a string. Suppose you have a file called data.txt in the current directory.
You could get a handle, f, to this file in Python with the following:
f = open('data.txt')
The open() call implicitly performs the following actions:
1. Makes sure that data.txt exists.
2. Creates a new handle to this file.
3. Sets the cursor position (pos) to the start of the file, pos = 0.
The call to open() does not read into memory any part of the file, write anything out
to the file, or close the file. All of these actions must be done separately and explicitly
and are accomplished through the use of file handle methods.
Methods are functions that are defined on a class and bound to an
object, as seen in Chapter 6.
Table 10-1 lists the most important file methods.
Table 10-1. Important file handle methods
Method
Description
f.read(n=-1)
Reads in n bytes from the file. If n is not present or is -1, the entire rest of the file is read.
f.readline()
Reads in the next full line from the file and returns a string that includes the newline
character at the end.
f.readlines()
Reads in the remainder of the file and returns a list of strings that end in newlines.
f.seek(pos)
Moves the file cursor to the specified position.
f.tell()
Returns the current position in the file.
Files in Python
www.it-ebooks.info
|
231
Method
Description
f.write(s)
Inserts the string s at the current position in the file.
f.flush()
Performs all pending write operations, making sure that they are really on disk.
f.close()
Closes the file. No more reading or writing can occur.
Suppose that matrix.txt represents a 4×4 matrix of integers. Each line in the file repre‐
sents a row in the matrix. Column values are separated by commas. Ideally, we would
be able to read this into Python as a list of lists of integers, since that is the most
Pythonic way to represent a matrix of integers. This is not the most efficient repre‐
sentation for a matrix—a NumPy array would be better—but it is fairly common to
read in data in a native Python format before continuing to another data structure.
The following snippet of code shows how to read in this matrix. To follow along, first
make sure that you create a matrix.txt file on your computer. You can do this by
copying the contents shown here into your favorite text editor and saving the file with
the right name:
matrix.txt
Code
1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6
f = open('matrix.txt')
[[1, 4, 15, 9],
matrix = []
[0, 11, 7, 3],
for line in f.readlines():
[2, 8, 12, 13],
row = [int(x) for x in line.split(',')] [14, 5, 10, 6]]
matrix.append(row)
f.close()
Python matrix
Notice that the lines from a file are always strings. This means that you have to con‐
vert the string versions of the values in matrix into integers yourself. Python doesn’t
know that you mean for the content of the file to be a matrix. In fact, the only
assumption about data types that Python can make is that a file contains strings. You
have to tell it how to interpret the contents of a file. Thus, any numbers in the file
must be converted from string forms to integers or floats. For a reminder on how to
convert between variable types, see “Variables” on page 42. Special file readers for
particular formats, which we won’t see here, may perform these conversions for you
automatically. However, under the covers, everything is still a string originally. At the
end of the preceding code snippet, also note that the file must be closed manually.
Even when you have reached the end of a file, Python does not assume that you are
done reading from it.
232
|
Chapter 10: Storing Data: Files and HDF5
www.it-ebooks.info
Files Should Always Be Closed!
A file that remains open unnecessarily can lead to accidental data
loss as well as being a security hazard. It is always better to close a
file prematurely and open a new file handle than it is to leave one
open and lingering. File handles are cheap to create, so perfor‐
mance should not be a concern.
Files are opened in one of multiple modes. The mode a file was opened with deter‐
mines the methods that can be used on the handle. Invalid methods are still present
on the handle, but trying to use them will raise an exception.
So far, we’ve only opened files in the default read-only mode. To change this, mode
flags may be passed into the open() call after the filename. The mode is specified as a
string of one or more characters with the special meanings listed in Table 10-2. A
common example is to open the file for writing and erase the existing contents. This
uses the 'w' flag:
f = open('data.txt', 'w')
Table 10-2. Useful file modes
Mode
Meaning
'r'
Read-only. No writing possible. Starting pos = 0.
'w'
Write. If the file does not exist, it is created; if the file does exist, the current contents
are deleted (be careful!). Starting pos = 0.
'a'
Append. Opens the file for writing but does not delete the current contents; creates the
file if it does not exist. Starting pos is at the end of the file.
'+'
Update. Opens the file for both reading and writing; may be combined with other flags;
does not delete the current contents. Starting pos = 0.
As a more sophisticated example, the following adds a row of zeros to the top of our
matrix and a row of ones to the end:
Files in Python
www.it-ebooks.info
|
233
Old matrix.txt
Code
New matrix.txt
1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6
f = open('matrix.txt', 'r+')
orig = f.read()
f.seek(0)
f.write('0,0,0,0\n')
f.write(orig)
f.write('\n1,1,1,1')
f.close()
0,0,0,0
1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6
1,1,1,1
Open the file in read and write mode, without overwriting the contents.
Read the entire file into a single string.
Go back to the start of the file.
Write a new line, clobbering what was there.
Write the original contents back to the file after the line that was just added.
Write another new line after the original contents.
Close the file now that we are done with it.
There are many times when no matter what happens in a block of code—success or
failure, completion or exception—special safety code must be run at the end of the
block. This is to prevent data loss, corruption, or even ending up in the wrong place
on the filesystem. In Python, administering these potentially hazardous situations is
known as context management. There are many context managers that perform
defensive startup actions when the code block is entered (right before the first state‐
ment), and other cleanup actions when the block is exited. Code blocks may be exited
either right after the last statement or following an uncaught exception. File handles
are the most common context managers in Python. As we have mentioned before,
files should always be closed. However, files can act as their own context managers.
When using a file this way, the programmer does not need to remember to manually
close the file; the call to the file’s close() method happens automatically.
The with statement is how a context is entered and exited. The syntax for this state‐
ment introduces the with Python keyword and reuses the as keyword. The with
statement has the following format:
with as :
234
|
Chapter 10: Storing Data: Files and HDF5
www.it-ebooks.info
Here, the is the actual context object, is a local variable
name that the context manager is assigned to, and the is the code that
is executed while the manager is open. The as portion of this syntax is
optional.
The matrix.txt file example from before can be expressed using a with statement as
follows:
matrix.txt
Code
matrix
0,0,0,0
1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6
1,1,1,1
matrix = []
with open('matrix.txt') as f:
for line in f.readlines():
row = [int(x) for x in line.split(',')]
matrix.append(row)
[[0, 0, 0, 0],
[1, 4, 15, 9],
[0, 11, 7, 3],
[2, 8, 12, 13],
[14, 5, 10, 6],
[1, 1, 1, 1]]
The file f is open directly following the colon (:).
f is closed by the context manager once the indentation level returns to
that of the with keyword.
Using with statements is the recommended way to use files, because not having to
explicitly call f.close() all of the time makes your code much safer and more robust.
Other kinds of context managers exist. You can write your own. However, files are the
context managers that you will most frequently encounter. This is because it is easy to
forget to close files, and the consequences can be relatively severe if you do.
Now that we have seen the basics of how to read and write normal Python files, we
can start to discover HDF5. HDF5 is one of the richest and most useful file formats
for scientific computing. HDF5 puts numeric data first. This helps distinguish it from
other file formats where strings rule, like we have seen in this section. Before getting
into the nitty-gritty, it will be helpful to first gain some perspective.
An Aside About Computer Architecture
Computers are physical tools, just like any other experimental device. So far, we have
been able to ignore how they work and their internal structure. When it comes to
data storage, though, there are enough subsystems simultaneously dancing that we
need to understand what a computer is in order to effectively program it. As you will
see later in this chapter, this knowledge can make the difference between your physics
software taking a week to run or five minutes.
Overlooking other peripheral devices (keyboard, mouse, monitor), a basic computer
has consisted of three main components since the 1980s: a central processing unit
An Aside About Computer Architecture
www.it-ebooks.info
|
235
(CPU), random-access memory (RAM), and a storage drive. Historically, the storage
drive has gone by the name hard disk drive (HDD), because the device was made up
of concentric spinning magnetic disks. More recent storage devices are built like a
flash memory stick; these drives are called solid state drives (SSDs). The CPU, RAM,
and storage can be thought of as living in series with each other, as seen in
Figure 10-1.
Figure 10-1. A simple model of a computer
In this simple model, the CPU can be thought of as a dumb calculator; RAM is what
“remembers” what the CPU just did (it acts sort of like short-term memory), and the
storage is what allows the computer to save data even when it is turned off (it’s like
long-term memory). In practice, when we talk about the computer “doing some‐
thing,” we really mean the CPU. When we talk about the filesystem, we really mean
the storage. It is important to understand, therefore, that RAM is what shuffles bytes
between these two components.
Of course, computer architectures have become much more complicated than this
simple model. CPU caches are one major mainstream advancement. These caches are
like small versions of RAM that live on the CPU. They contain copies of some of the
data in RAM, but much closer to the processor. This prevents the computer from
having to go out to main memory all of the time. For commonly accessed data, the
caches can provide huge decreases in execution time. The caches are named after a
hierarchy of level numbers, such L1, L2, and so on. In general, the higher the level
number, the smaller the cache size, but the faster it is to access it. Currently, most
processers come with L1 and L2 caches. Some processors are now also starting to
come with L3 caches. Figure 10-2 represents a computer with CPU caches.
Figure 10-2. A computer with L1, L2, and L3 CPU caches
236
|
Chapter 10: Storing Data: Files and HDF5
www.it-ebooks.info
The other big innovation in computer architecture is graphics processing units, or
GPUs. These are colloquially known as graphics cards. They are processors that live
outside of the main CPU. A computer with a GPU is displayed in Figure 10-3.
Figure 10-3. A computer with a GPU
Though there are many important differences between GPUs and CPUs, very
roughly, you can think of GPUs as being really good at floating-point operations.
CPUs, on the other hand, are much better at integer operations than GPUs (while still
being pretty good with floating-point data). So, if you have an application that is pri‐
marily made up of floats, then GPUs may be a good mechanism to speed up your exe‐
cution time.
Naturally, there is a lot more to computer engineering and architecture than what you
have just seen. However, this gives you a good mental model of the internal structure
of a computer. Keep this in mind as we proceed to talking about databases and HDF5.
Many real-world programming trade-offs are made and balanced because of the
physical performance of the underlying machine.
Big Ideas in HDF5
Persisting structured, numerical data to binary formats is superior to using plain-text
ASCII files. This is because, by their nature, they are often smaller. Consider the fol‐
lowing comparison between integers and floats in native and string representations:
# small ints
42
(4 bytes)
'42' (2 bytes)
# medium ints
123456
(4 bytes)
'123456' (6 bytes)
# near-int floats
12.34
(8 bytes)
'12.34' (5 bytes)
# e-notation floats
42.424242E+42
(8 bytes)
'42.424242E+42' (13 bytes)
Big Ideas in HDF5
www.it-ebooks.info
|
237
In most cases, the native representation is smaller than the string version. Only by
happenstance are small integers and near-integer floats smaller in their string forms.
Such cases are relatively rare on average, so native formats almost always outperform
the equivalent strings in terms of space.
Space is not the only concern for files. Speed also matters. Binary formats are always
faster for I/O because in order to do real math with the numbers, if they are in a
string form you have to convert them from strings to the native format. The Python
conversion functions int() and float() are known to be relatively slow because the
C conversion functions atoi() and atof() that they wrap around are expensive
themselves.
Still, it is often desirable to have something more than a binary chunk of data in a file.
HDF5 provides common database features such as the ability to store many datasets,
user-defined metadata, optimized I/O, and the ability to query its contents. Unlike
SQL, where every dataset lives in a single namespace, HDF5 allows datasets to live in
a nested tree structure. In effect, HDF5 is a filesystem within a file—this is where the
“hierarchical” in the name comes from.
PyTables provides the following basic dataset classes that serve as entry points for
various HDF5 constructs:
Array
The files of the filesystem
CArray
Chunked arrays
EArray
Extendable arrays
VLArray
Variable-length arrays
Table
Structured arrays
All of these must be composed of what are called atomic types in PyTables. The
atomic types are roughly equivalent to the primitive NumPy types that were seen in
“dtypes” on page 204. There are six kinds of atomic types supported by PyTables.
Here are their names, descriptions, and supported sizes:
bool
True or false type—8 bits
int
Signed integer types—8, 16, 32 (default), and 64 bits
238
|
Chapter 10: Storing Data: Files and HDF5
www.it-ebooks.info
uint
Unsigned integer types—8, 16, 32 (default), and 64 bits
float
Floating-point types—16, 32, and 64 (default) bits
complex
Complex floating-point types—64 and 128 (default) bits
string
Fixed-length raw string type—8 bits times the length of the string
Other elements of the hierarchy may include:
Groups
The directories of the filesystem; may contain other groups and datasets
Links
Like soft links on the filesystem
Hidden nodes
Like hidden files
These pieces together are the building blocks that you can use to richly describe and
store your data. HDF5 has a lot of features and supports a wide variety of use cases.
That said, simple operations are easy to implement. Let’s start with basic file reading
and writing.
File Manipulations
HDF5 files may be opened from Python via the PyTables interface. To get PyTables,
first import tables. Like with numpy and np, it is common to abbreviate the tables
import name to tb:
import tables as tb
f = tb.open_file('/path/to/file', 'a')
Files have modes that they may be opened in, similarly to how plain-text files are
opened in Python. Table 10-3 displays the modes that are supported by PyTables.
File Manipulations
www.it-ebooks.info
|
239
Table 10-3. HDF5 file modes
Attribute
Description
r
Read-only—no data can be modified.
w
Write—a new file is created; if a file with that name exists, it is deleted.
a
Append—an existing file is opened for reading and writing, or if the file does not exist, it is created.
r+
Similar to a, but the file must already exist.
In HDF5, all nodes stem from a root node, "/" or f.root. In PyTables, you may
access subnodes as attributes on nodes higher up in the hierarchy—e.g.,
f.root.a_group.some_data. This sort of access only works when all relevant nodes
in the tree have names that are also valid Python variable names, however; this is
known as natural naming.
Creating new nodes must be done on the file handle, not the nodes themselves. If we
want to make a new group, we have to use the create_group() method on the file.
This group may then be accessed via the location it was created in. For example, cre‐
ating and accessing a group called a_group on the root node can be done as follows:
f.create_group('/', 'a_group', "My Group")
f.root.a_group
Possibly more important than groups, the meat of HDF5 comes from datasets. The
two most common datasets are arrays and tables. These each have a corresponding
create method that lives on the file handle, called create_array() and
create_table(). Arrays are of fixed size, so you must create them with data. The
type of the data in the HDF5 file will be interpreted via numpy. Tables, like NumPy
structured arrays, have a set data type. Unlike arrays, tables are variable length, so we
may append to them after they have been created. The following snippet shows how
to create an array and a table and how to populate them using Python lists and
NumPy arrays:
# integer array
f.create_array('/a_group', 'arthur_count', [1, 2, 5, 3])
# tables need descriptions
dt = np.dtype([('id', int), ('name', 'S10')])
knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
f.create_table('/', 'knights', dt)
f.root.knights.append(knights)
240
|
Chapter 10: Storing Data: Files and HDF5
www.it-ebooks.info
At this point, the hierarchy of groups and datasets in the file is represented by the
following:
/
|-- a_group/
|
|-- arthur_count
|
|-- knights
Arrays and tables attempt to preserve the original flavor, or data structure, with
which they were created. If a dataset was created with a Python list, then reading out
the data will return a Python list. If a NumPy structured array was used to make the
data, then a NumPy structured array will be returned. Note that you can read data
from a dataset simply by slicing (described in Chapter 9). One great thing about
PyTables and HDF5 is that only the sliced elements will be read in from disk. Parts of
the dataset that are not included in the slice will not be touched. This speeds up read‐
ing by not making the computer do more work than it has to, and also allows you to
read in portions of a dataset whose whole is much larger than the available memory.
Using our sample arthur_count array, the following demonstrates flavor preservation.
Also note that the type of the dataset comes from PyTables, and this is separate from
the type of the data that is read in:
Code
Returns
f.root.a_group.arthur_count[:]
[1, 2, 5, 3]
type(f.root.a_group.arthur_count[:])
list
type(f.root.a_group.arthur_count)
tables.array.Array
Since the arthur_count array came from a Python list, only Python list slicing is avail‐
able. However, if a dataset came from a NumPy array originally, then it can be
accessed in a NumPy-like fashion. This includes slicing, fancy indexing, and masking.
The following demonstrates this NumPy-like interface on our knights table:
File Manipulations
www.it-ebooks.info
|
241
Code
Returns
f.root.knights[1]
(12, 'Bedivere')
f.root.knights[:1]
array([(42, 'Lancelot')],
dtype=[('id', '
h = hash('proton')
i = h % size(tab)
return table[i].value
The main innovation of hash tables for associating data is that they prevent testing
equality over all of the keys to find the corresponding value. In fact, even though the
lookup mechanics are more complex, they perform on average much faster than if
you were to search through an equivalent data structure, like a list of tuples. In the
worst case, they still perform just as well as these alternatives.
For getting, setting, or deleting an item, hash tables are on average order-1. This
means that whether you have an array with 10 elements or 50 billion elements,
retrieving an item will take the same amount of time. In the worst case, getting, set‐
ting, and deleting items can take up to order-N, or O(N), where N is the size of the
table. This incredibly unlikely situation is equivalent to looping over the whole table,
which is what you would have to do in the worst case when using a list of tuples.
Big O Notation
Big O notation is a shorthand for describing the limiting behavior
of an algorithm with respect to the size of the data. This looks like a
function O() with one argument. It is meant to be read as “order
of ” the argument. For example, O(N) means “order-N” and O(n
log(n)) is “order n log(n).” It is useful for quickly describing the
relative speed of an algorithm without having to think about the
specifics of the algorithm’s implementation.
However, there are two big problems with hash tables as formulated previously. What
happens when the table runs out of space and you want to add more items? And what
happens when two different keys produce the same hash?
Resizing
When a hash table runs out of space to store new items, it must be resized. This typi‐
cally involves allocating new space in memory, copying all of the data over, and reor‐
ganizing where each item lives in the array. This reorganization is needed because a
hash table takes the modulus by the table size to determine the row index.
From Table 11-1, take the neutrino key, whose hash is -4886380829903577079. This
hash value is the same no matter the table length. However, when you mod this value
by 8, it produces an index of 1. If the table size were doubled, then the neutrino hash
mod 16 would produce an index of 9. In general, each index will change as the result
Hash Tables
www.it-ebooks.info
|
259
of a resize. Thus, a resize does more than just copy the data: it also rearranges it. For
example, consider resizing Table 11-1 to length 12. This expanded table can be seen
in Table 11-2.
Table 11-2. Longer sample hash table
i
Key
hash(key)
hash(key)%12 Value
0
1
2
'electron' 4017007007162656526
2
0.000548579909
3
'neutron'
3
1.008664
'neutrino' -4886380829903577079
5
3.31656614e-9
'proton'
9
1.00727647
3690763999294691079
4
5
6
7
8
9
-4127328116439630603
10
11
The two hash tables seen in Table 11-1 and Table 11-2 contain the same information
and are accessed in the same way. However, they have radically different layouts,
solely based on their size.
The size of the table and the layout are handled automatically by the hash table itself.
Users of a hash table should only have to worry about resizing insofar as to under‐
stand that multiple insertion operations that each emplace one item will almost
always be more expensive than a single insertion that emplaces multiple items. This is
because multiple insertions will force the hash table to go through all of the inter‐
mediate sizes. On insertion of multiple items, the hash table is allowed to jump
directly to the final size.
260
|
Chapter 11: Important Data Structures in Physics
www.it-ebooks.info
For Python dictionaries, automatic resizing means that you should
try to update() dicts when possible rather than assigning new
entries (d[key] = value) over and over again. Given two dictionar‐
ies x and y, it is much better to use x.update(y) than to write:
for key, value in y.items():
x[key] = value
Different hash table implementations choose different strategies for deciding when to
resize (table is half full, three-quarters full, completely full, never) and by how much
to resize (double the size, half the size, not at all). Resizing answers the question of
what to do when a hash table runs out of space. However, we still need to address
what to do when two keys accidentally produce the same index.
Collisions
A hash collision occurs when a new key hashes to the same value as an existing key in
the table. Even though the space of hash values is huge, ranging from
-9223372036854775808 to 9223372036854775807 on 64-bit systems, hash collisions
are much more common than you might think. For simple data, it is easy to show
that an empty string, the integer 0, the float 0.0, and False all hash to the same value
(namely, zero):
hash('') == hash(0) == hash(0.0) == hash(False) == 0
However, even for random keys, hash collisions are an ever-present problem. Such
collisions are an expression of the birthday paradox. Briefly stated, the likelihood that
any two people in a room will have the same birthday is much higher than that of two
people sharing a specific birthday (say, October 10th). In terms of hash tables, this
can be restated as “the likelihood that any pair of keys will share a hash is much
higher than the probability that a key will have a given hash (say, 42) and that another
key will have that same hash (again, 42).”
Set the variable s as the size of the table and N as the number of distinct hash values
possible for a given hash function. An approximate expression for the likelihood of a
hash collision is given by pc s :
pc s = 1 − e
−s s − 1
2N
This may be further approximated as:
pc s =
s2
2N
Hash Tables
www.it-ebooks.info
|
261
For Python dicts, N=2**64, and Figure 11-1 shows this curve. After about a billion
items, the probability of a hash collision in a Python dict starts going up dramati‐
cally. For greater than 10 billion items, a collision is effectively guaranteed. This is
surprising, since the total range of the space is about 1.844e19 items. This means that
a collision is likely to happen even though only one-billionth of the space is filled. It is
important to note that the shape of this curve is the same for all values of N, though
the location of the inflection point will change.
Figure 11-1. Hash collision probability for Python dictionaries
Given that hash collisions will happen in any real application, hash table implementa‐
tions diverge in the way that they choose to handle such collisions. Some of the
strategies follow the broad strokes presented here:
• Every index is a bucket, or list of key/value pairs. The hash will take you to the
right bucket, and then a linear search through every item in the bucket will find
the right value. This minimizes the number of resizes at the expense of a linear
search and a more complex data structure. This is known as separate chaining.
• In the event of a collision, the hash is modified in a predictable and invertible
way. Continued collisions will cause repeated modifications and hops. Searching
for a key will first try the hash and the successive modifications. This is known as
open addressing and is the strategy that Python dictionaries implement. This has
the benefit that all items live flat in the table. There are no sublists to search
262
|
Chapter 11: Important Data Structures in Physics
www.it-ebooks.info
through. However, this comes at the cost that the index of a key is not computa‐
ble from the hash of the key alone. Rather, the index depends on the full history
of the hash table—when items were inserted and deleted.
• Always resize to the point of zero collisions. This works well where all of the keys
must share the same type. This is sometimes called a bidirectional or bijective
hash map because the keys are uniquely determined by their hashes, just like the
hashes are uniquely determined by their keys.
As a user of hash tables, the details of how the collisions are handled are less impor‐
tant than the fact that collisions happen and they affect performance.
In summary, hash tables are beautiful and ubiquitous. On average they have incredi‐
ble performance properties, being order-1 for all operations. This comes at the
expense of significantly increased implementation complexity over more fundamen‐
tal containers, such as arrays. Luckily, since almost every modern programming lan‐
guage supports hash tables natively, only in the rarest circumstances will you ever
need to worry about writing one yourself.
Up next, we will talk about another data structure that you should never have to
implement, but that is crucial to have in your analysis toolkit.
Data Frames
A relative newcomer, the data frame is a must-have data structure for data analysis. It
is particularly useful to experimentalists. The data frame is an abstraction of tables
and structured arrays. Given a collection of 1D arrays, the data frame allows you to
form complex relationships between those arrays. Such relationships extend beyond
being columns in a table, though that is possible too. One of the defining features of
data frames that makes them invaluable for experimentalists is that they gracefully
handle missing data. Anyone who has worked with real data before knows that this is
a feature that should not be overlooked.
The data frame was popularized by the R language, which has a native data frame
implementation and is largely (though not solely) used for statistics. More recently,
the pandas package for Python implements a data frame and associated tools.
Data frames are effectively tables made up of named columns called series. Unlike in
other table data structures we have seen, the series that make up a data frame may be
dynamically added to or removed from the frame. A series can be thought of as a 1D
NumPy array (it has a dtype) of values along with a corresponding index array. The
index specifies how values of the series are located. If no index is provided, regular
integers spanning from zero to one minus the length of the series are used. In this
case, series are very similar to plain old NumPy arrays. The value of indexes, as we
will see, is that they enable us to refer to data by more meaningful labels than zero
Data Frames
www.it-ebooks.info
|
263
through N-1. For example, if our values represented particle counts in a detector, the
index could be made up of the strings 'proton', 'electron', and so on. The data
frames themselves may also have one or more index arrays. If no index is provided,
then zero to N-1 is assumed. In short, data frames are advanced in-memory tables
that allow human-readable data access and manipulation through custom indexes.
The usage of data frames was first presented in Chapter 8. Here, we will cover their
basic mechanics. For the following examples, you’ll need to import the pandas pack‐
age. Note that like numpy and pytables, the pandas package is almost always impor‐
ted with an abbreviated alias, namely pd:
import pandas as pd
Let’s start by diving into series.
Series
The Series class in pandas is effectively a one-dimensional NumPy array with an
optional associated index. While this is not strictly accurate, much of the NumPy fla‐
vor has been transferred to Series objects. A series may be created using array-like
mechanisms, and they share the same primitive dtype system that NumPy arrays use.
The following example creates a series of 64-bit floats:
Code
Returns
pd.Series([42, 43, 44], dtype='f8')
0
42
1
43
2
44
dtype: float64
Index and values columns
dtype of the values
Note that the column on the left is the index, while the column on the right displays
the values. Alternatively, we could have passed in our own custom noninteger index.
The following shows a series s with various particle names used as the index and the
values representing the number of the associated particle that a detector has seen:
Code
Returns
s = pd.Series([42, 43, 44],
index=["electron",
"proton",
"neutron"])
electron
42
proton
43
neutron
44
dtype: int64
264
|
Chapter 11: Important Data Structures in Physics
www.it-ebooks.info
The index itself is very important, because this dictates how the elements of the series
are accessed. The index is an immutable ndarray that is composed of only hashable
data types. As with the keys in a dictionary, the hashability ensures that the elements
of an index may be used to safely retrieve values from a series. In the following code
snippet, we see that we can index into a series in a dict-like fashion to pull out a sin‐
gle value. We can also slice by indices to create a subseries, because the index itself has
an order. Finally, even though the series has an index, we can always go back to
indexing with integers:
Code
Returns
s['electron']
42
# inclusive bounds
s['electron':'proton']
electron
42
proton
43
dtype: int64
# integer indexing still OK
s[1:]
proton
43
neutron
44
dtype: int64
Series may also be created from dictionaries. In this case, the keys become the index
and the elements are sorted according to the keys. The following code demonstrates a
series t being created from a dict with string particle keys and associated integer
values:
Code
Returns
t = pd.Series({'electron': 6,
'neutron': 28,
'proton': 496,
'neutrino': 8128})
electron
6
neutrino
8128
neutron
28
proton
496
dtype: int64
Additionally, arithmetic and other operations may be performed on a series or a com‐
bination of series. When two series interact, if an element with a particular index
exists in one series and that index does not appear in the other series, the result will
contain all indices. However, the value will be NaN (Not a Number) for the missing
index. This means that the datasets will only grow, and you will not lose an index.
However, the presence of a NaN may not be desired. Reusing the s and t series from
our previous examples, the following code adds these two series together. Since t has
a neutrino element that s does not, the expression s + t will have a neutrino ele‐
ment, but its value will be NaN:
Data Frames
www.it-ebooks.info
|
265
Code
Returns
s + t
electron
48
neutrino
NaN
neutron
72
proton
539
dtype: float64
The advantage of having NaN elements show up in the resulting series is that they
make it very clear that the input series to the operation did not share a common basis.
Sometimes this is OK, like when you do not care about neutrinos. At other times, like
when you want to sum up the total number of counts, the NaN elements are problem‐
atic. This forces you to deal with them and adhere to best practices. There are two
approaches for dealing with a NaN. The first is to go back to the original series and
make sure that they all share a common index. The second is to filter or mask out the
NaN elements after the other operations have completed. In general, it is probably best
to go back to the original series and ensure a common basis for comparison.
Unless otherwise specified, for almost all operations pandas will return a copy of the
data rather than a view, as was discussed in “Slicing and Views” on page 208. This is
to prevent accidental data corruption. However, it comes at the cost of speed and
memory efficiency.
Now that we know how to manipulate series on their own, we can combine many ser‐
ies into a single data frame.
The Data Frame Structure
The DataFrame object can be understood as a collection of series. These series need
not share the same index, though in practice it is useful if they do because then all of
the data will share a common basis. The data frame is a table-like structure akin to a
NumPy structured array or a PyTables Table. The usefulness of data frames is the
same as other table data structures. They make analyzing, visualizing, and storing
complex heterogeneous data easier. Data frames, in particular, provide a lot of helpful
semantics that other table data structures do not necessarily have. Data frames are
distinct from other table-like structures, though, because their columns are series, not
arrays. We can create a data frame from a dictionary of arrays, lists, or series. The
keys of the dictionary become the column names. Reusing the definitions of s and t
from before, we can create a data frame called df:
266
|
Chapter 11: Important Data Structures in Physics
www.it-ebooks.info
Code
Returns
df = pd.DataFrame({'S': s, 'T': t})
S
T
electron 42
6
neutrino NaN 8128
neutron
44
28
proton
43
496
[4 rows x 2 columns]
You can also create a data frame from a NumPy structured array or a list of tuples.
Data frames may be saved and loaded from CSV files, HDF5 files (via PyTables),
HTML tables, SQL, and a variety of other formats.
Data frames may be sliced, be appended to, or have rows removed from them, much
like other table types. However, data frames also have the sophisticated indexing
semantics that series do. The following code demonstrates some data frame manipu‐
lations:
Code
Returns
df[::2]
electron
neutron
S
T
42
6
44 28
[2 rows x 2 columns]
dg = df.append(
pd.DataFrame({'S': [-8128]},
index=['antineutrino']))
S
T
electron
42
6
neutrino
NaN 8128
neutron
44
28
proton
43
496
antineutrino -8128
NaN
[5 rows x 2 columns]
dh = dg.drop('neutron')
Slice every other element.
Add a new index to the data frame and value to S.
S
T
electron
42
6
neutrino
NaN 8128
proton
43
496
antineutrino -8128
NaN
[4 rows x 2 columns]
Delete the neutron index.
You may also easily transpose the rows and columns via the T attribute, as seen here:
Data Frames
www.it-ebooks.info
|
267
Code
Returns
df.T
S
T
electron neutrino neutron proton
42
NaN
44
43
6
8128
28
496
[2 rows x 4 columns]
Arithmetic and other operations may be applied to the whole data frame. The follow‐
ing example creates a Boolean mask data frame by comparing whether the data is less
than 42. Note that the Boolean mask is itself another data frame:
Code
Returns
df < 42
electron
neutrino
neutron
proton
S
T
True
True
False False
False
True
True False
[4 rows x 2 columns]
A major innovation of the data frame is the ability to add and remove columns easily.
With NumPy structured arrays, adding a new column to an existing array involves
creating a new compound dtype to represent the new table, interleaving the new col‐
umn data with the existing table, and copying all of the data into a new structured
array. With data frames, the notion of a column is flexible and interchangeable with
the notion of an index. Data frames are thus much more limber than traditional
tables for representing and manipulating data. Column access and manipulation
occurs via dict-like indexing. Such manipulation can be seen with the existing df
data frame:
268
|
Chapter 11: Important Data Structures in Physics
www.it-ebooks.info
Code
Returns
# accessing a single column
# will return a series
df['T']
electron
6
neutrino
8128
neutron
28
proton
496
Name: T, dtype: int64
# setting a name to a series
# or expression will add a
# column to the frame
df['small'] = df['T'] < 100
S
T small
electron 42
6
True
neutrino NaN 8128 False
neutron
44
28
True
proton
43
496 False
[4 rows x 3 columns]
# deleting a column will
# remove it from the frame
del df['small']
S
T
electron 42
6
neutrino NaN 8128
neutron
44
28
proton
43
496
[4 rows x 2 columns]
These kinds of column operations, along with reindexing, groupby(), missing data
handling, plotting, and a host of other features, make data frames an amazing data
structure. We have only scratched the surface here. While they may not be able to
handle the kinds of data volumes that parallel chunked arrays do, for everyday data
analysis needs nothing beats the flexibility of data frames. As long as your data can
nicely fit into memory, then data frames are a great choice.
The next section takes us to the other end of the data volume spectrum, with a
detailed discussion of B-trees.
B-Trees
B-trees are one of the most common data structures for searching over big chunks of
data. This makes them very useful for databases. It is not an understatement to say
that HDF5 itself is largely based on the B-tree. Chunks in a dataset are stored and
managed via B-trees. Furthermore, the hierarchy itself (the collection of groups and
arrays and tables) is also managed through a B-tree.
A B-tree is a tree data structure where all of the nodes are sorted in a breadth-first
manner. Each node may have many subnodes. This structure makes it easy to search
for a specific element because, starting at the top root node, you can simply test
whether the value you are looking for is greater or less than the values in the nodes
that you are at currently. The following example represents a B-tree of the Fibonacci
sequence. Here, the square brackets ([]) represent nodes in the B-tree. The numbers
inside of the square brackets are the values that the B-tree is currently storing:
B-Trees
www.it-ebooks.info
|
269
.--.--------[5
89]----------.
|
|
[1 2 3] [21]
.---.--[233 1597]--.
/ \
/
|
|
[8 13] [34 55] [144] [377 610 987] [2584 4181]
/
It is important to note that each node in a B-tree may store many values, and the
number of values in each node may vary. In the simplest case, where each node is
constrained to have a single value, the B-tree becomes a binary search tree (not to be
confused with a binary tree, which we won’t discuss here). The following diagram
shows a binary search tree specialization of a small B-tree:
[5]
/
[2]
/
[1]
\
[8]
\
[3]
B-trees (and binary search trees) may be rotated. This means that the nodes can be
rearranged to have a different structure without breaking the search and ordering
properties. For example, the tree above may be rotated to the following tree with
equivalent average search properties:
[2]
/
[1]
\
[5]
/
[3]
\
[8]
B-trees are very effective for nonlinearly organizing array data. The index of the array
determines on which node in the tree the array lives. The tree itself manages the loca‐
tions of all of the nodes. The nodes manage the data chunks assigned to them. The
ability for nodes to be inserted and removed at arbitrary indices allows for arrays to
have missing chunks, be infinitely long, and be extendable.
In practice, B-trees tend to follow some additional simple rules as a way of reaping
performance benefits and making the logic easier to understand:
• The height of the tree, h, is constant. All leaves (terminal nodes) exist at the same
height.
• The root node has height 0.
• The maximum number of child nodes, m, is kept below a constant number across
all nodes.
• Nodes should be split as evenly as possible over the tree in order to be balanced.
The size of a tree is measured by how many nodes (n) it has. Getting, setting, and
deleting nodes in a B-tree are all order log(n) operations on average. The worst-case
270
|
Chapter 11: Important Data Structures in Physics
www.it-ebooks.info
behavior for these operations is also order log(n), and it is possible to do better than
this in the unlikely event that the node you are looking for is higher up the tree than
being on a leaf. These properties make B-trees highly desirable from a reliability
standpoint. Table 11-3 shows a comparison between B-tree and hash table perfor‐
mance.
Table 11-3. Performance comparison between B-trees and hash tables for common
operations
Operation
Hash table average Hash table worst B-tree average B-tree worst
Get: x[key]
O(1)
O(n)
O(log n)
O(log n)
Set: x[key] = value
O(1)
O(n)
O(log n)
O(log n)
Delete: del x[key]
O(1)
O(n)
O(log n)
O(log n)
In practice, you will want to use a B-tree whenever you need to quickly find an ele‐
ment in a large, regular, and potentially sparse dataset. If you happen to be writing a
database, B-trees are probably what you want to use to index into the datasets on disk.
B-trees were presented here because they are used frequently under the covers of the
databases that we have already seen, such as HDF5. Furthermore, if you have a map‐
ping that you want to ensure is always sorted by its keys, you will want to use a B-tree
instead of a dictionary. Still, in most other circumstances, a dictionary, NumPy array,
or data frame is typically a better choice than a B-tree.
B-trees as a data structure are too complex to give a full and working example here.
While there is no accepted standard implementation in Python, there are many libra‐
ries that have support for B-trees. You can install and play around with them. A few
that are worth looking into are:
btree
A dict-like C extension module implementation
BTrees
A generic B-tree implementation optimized for the Zope Object Database
(ZODB)
blist
A list-, tuple-, set-, and dict-like data structure implemented as a B+-tree (a
variant of a strict B-tree)
Let’s take a quick look at the blist package, since it has the best support for Python 2
and 3. This package has a sorteddict type that implements a traditional B-tree data
structure. Creating a sorteddict is similar to creating a Python dictionary. The fol‐
B-Trees
www.it-ebooks.info
|
271
lowing code imports sorteddict, creates a new B-tree with some initial values, and
adds a value after it has been created:
Code
Returns
from blist import sorteddict
sorteddict({'birthday': [1879, 3, 14],
'first': 'Albert',
'last': 'Einstein'})
b = sorteddict(first="Albert",
last="Einstein",
birthday=[1879, 3, 14])
['birthday', 'died', 'first', 'last']
b['died'] = [1955, 4, 18]
list(b.keys())
The keys always appear sorted, because of how B-trees work. Even though the sorted
dict implements a dictionary interface, its performance characteristics and underly‐
ing implementation are very different from that of a hash table.
We have now seen three possible ways to store associations
between keys and values: hash tables, series, and B-trees. Which
one you should use depends on your needs and the properties of
the data structure.
B-trees are great for data storage and for organizing array chunks. For this reason,
they are used in databases quite a bit. The next section presents a different tree struc‐
ture that excels at storing geometry.
K-D Trees
A k-d tree, or k-dimensional tree, is another tree data structure. This one excels at
finding the nearest neighbor for points in a k-dimensional space. This is extraordi‐
narily useful for many physics calculations. Often times, when solving geometric par‐
tial differential equations, the effects that matter the most in the volume at hand come
from the directly surrounding cells. Splitting up the problem geometry into a k-d tree
can make it much faster to find the nearest neighbor cells.
The big idea behind k-d trees is that any point (along with the problem bounds)
defines a k-1 dimensional hyperplane that partitions the remaining space into two
sections. For example, in 1D a point p on a line l will split up the line into the space
above p and the space below p. In two dimensions, a line will split up a box. In three
dimensions, a plane will split a cube, and so on. The points in a k-d tree can then be
placed into a structure similar to a binary search tree. The difference here is that sort‐
ing is based on the point itself along an axis a and that a is equal to the depth level of
272
| Chapter 11: Important Data Structures in Physics
www.it-ebooks.info
the point modulo the number of dimensions, k. Thus, a effectively defines an orienta‐
tion for how points should partition the space.
K-d trees are not often modified once they are instantiated. They typically have get or
query methods but do not have insert or delete methods. The structure of space is
what it is, and if you want to restructure space you should create a whole new tree.
For n points, k-d trees are order log(n) on average for all of their operations. In the
worst case, they are order n.
Furthermore, k-d trees are most effective when k is small. This is ideal for physics cal‐
culations, where k is typically 3 for three spatial dimensions and rarely goes above 6.
For simplicity, the examples here will use k=2. This makes the partitions simple line
segments.
Since we do not need to worry about insertions or deletions, we can represent a sam‐
ple k-d tree algorithm as follows:
class Node(object):
def __init__(self, point, left=None, right=None):
self.point = point
self.left = left
self.right = right
def __repr__(self):
isleaf = self.left is None and self.right is None
s = repr(self.point)
if not isleaf:
s = "[" + s + ":"
if self.left is not None:
s += "\n left = " + "\n ".join(repr(self.left).split('\n'))
if self.right is not None:
s += "\n right = " + "\n ".join(repr(self.right).split('\n'))
if not isleaf:
s += "\n ]"
return s
def kdtree(points, depth=0):
if len(points) == 0:
return None
k = len(points[0])
a = depth % k
points = sorted(points, key=lambda x: x[a])
i = int(len(points) / 2) # middle index, rounded down
node_left = kdtree(points[:i], depth + 1)
node_right = kdtree(points[i+1:], depth + 1)
node = Node(points[i], node_left, node_right)
return node
A tree consists of nodes.
K-D Trees
www.it-ebooks.info
|
273
A node is defined by its point. Since this is a binary search tree, it may have one
node to the left and one node to the right.
A string representation of the node given its relative location in the tree.
A recursive function that returns the root node given a list of points. This will
automatically be balanced.
As you can see, the Node class is very simple. It holds a point, a left child node, and a
right child node. The heavy lifting is done by the kdtree() function, which takes a
sequence of points and recursively sets up the nodes. Each node having only two chil‐
dren makes k-d trees much easier to manipulate than B-trees. As an example, con‐
sider the following random sequence of points internal to the range [0, 6] in 2-space.
These can be placed into a k-d tree using code like the following:.
Code
Returns
points = [(1, 2), (3, 2),
(5, 5), (2, 1),
(4, 3), (1, 5)]
root = kdtree(points)
print(root)
[(3, 2):
left = [(1, 2):
left = (2, 1)
right = (1, 5)
]
right = [(5, 5):
left = (4, 3)
]
]
The partitioning generated by this k-d tree is visualized in Figure 11-2.
For a rigorously tested, out-of-the-box k-d tree, you should use the KDTree class
found in scipy.spatial. This is a NumPy-based formulation that has more impres‐
sive querying utilities than simply setting up the tree. The implementation of KDTree
also fails over to a brute-force search when the search space goes above a userdefinable parameter. Using the points list from our previous example, the following
creates an instance of the KDTree class:
from scipy.spatial import KDTree
tree = KDTree(points)
274
|
Chapter 11: Important Data Structures in Physics
www.it-ebooks.info
Figure 11-2. K-d tree partitioning example
This tree object has a data attribute that is a NumPy array representing the points. If
the points were originally a NumPy array the data attribute would be a view, not a
copy. The following shows that the tree’s data retains the order of the original points:
Code
Returns
tree.data
array([[1, 2],
[3, 2],
[5, 5],
[2, 1],
[4, 3],
[1, 5]])
The query() method on the KDTree class takes a sequence of points anywhere in
space and returns information on the N nearest points. It returns an array of distances
to these points as well as the indices into the data array of the points themselves. Note
that query() does not return the cell in which a point lives. Using the tree we con‐
structed previously again, let’s find the nearest point in the tree to the location (4.5,
1.25):
K-D Trees
www.it-ebooks.info
|
275
Code
Returns
# query() defaults to only the closest point
dist, idx = tree.query([(4.5, 1.25)])
# results
array([ 1.67705098])
dist
array([1])
idx
array([[3, 2]])
# fancy index by idx to get the point
tree.data[idx]
The result of this query may be seen graphically in Figure 11-3.
Figure 11-3. K-d tree nearest neighbor query, distance shown as dashed line
These querying capabilities are very useful whenever you have data for some points
in space and want to compute the corresponding values for any other point based on
the data that you have. For example, say each of your points has a corresponding
measurement for magnitude of the electric field at that location. You could use a k-d
tree to determine the nearest neighbor to any point in space. This then lets you
approximate the electric field at your new point based on the distance from the clos‐
est measured value. In fact, this strategy may be used for any scalar or vector field.
276
|
Chapter 11: Important Data Structures in Physics
www.it-ebooks.info
The underlying physics may change, but finding the nearest neighbors via a k-d tree
does not.
Due to their broad applicability, k-d trees come up over and over again in computa‐
tional physics. For more information on the KDTree class, please refer to the SciPy
documentation.
Data Structures Wrap-up
As a scientist, organizing data is an integral part of daily life. However, there are many
ways to represent the data you have. Familiarity with a variety of strategies gives you
more flexibility to choose the option that best fits any given task. What you have
learned here are some of the most significant to physics-based fields. Of these, hash
tables are useful to just about everyone (and are not limited to physics). K-d trees, on
the other hand, are most useful when you’re thinking about problems spatially. Physi‐
cists do so a lot more than other folks. You should now be familiar with the following
concepts:
• Hash tables require a good hash function, which is provided to you by Python.
• Resizing a hash table can be expensive.
• Hash collisions will happen.
• Series are like NumPy arrays but with more generic indexing capabilities.
• Data frames are like tables with series as columns.
• Data frames handle missing data through NaN values.
• B-trees can be used to organize chunks of an array.
• You may rotate B-trees to change their structure without altering their perfor‐
mance.
• A binary search tree is a B-tree with only one piece of data per node.
• K-d trees are a variant of the binary search tree that are organized by points in kdimensional space.
• K-d trees are exceptionally useful for problems involving geometry.
Now that you know how to represent your data, the next chapter will teach you how
to analyze and visualize it.
Data Structures Wrap-up
www.it-ebooks.info
|
277
www.it-ebooks.info
CHAPTER 12
Performing in Parallel
A natural way to approach parallel computing is to ask the question, “How do I do
many things at once?” However, problems more often arise in the form of, “How do I
solve my one problem faster?” From the standpoint of the computer or operating sys‐
tem, parallelism is about simultaneously performing tasks. From the user’s perspec‐
tive, parallelism is about determining dependencies between different pieces of code
and data so that the code may be executed faster.
Programming in parallel can be fun. It represents a different way of thinking from the
traditional "x then y then z" procedure that we have seen up until now. That said, par‐
allelism typically makes problems faster for the computer to execute, but harder for
you to program. Debugging, opening files, and even printing to the screen all become
more difficult to reason about in the face of P processors. Parallel computing has its
own set of rewards, challenges, and terminology. These are important to understand,
because you must be more like a mechanic than a driver when programming in
parallel.
Physicists often approach parallelism only when their problems finally demand it.
They also tend to push back that need as far as possible. It is not yet easier to program
in parallel than it is to program procedurally. Here are some typical reasons that par‐
allel solutions are implemented:
• The problem creates or requires too much data for a normal machine.
• The sun would explode before the computation would finish.
• The algorithm is easy to parallelize.
• The physics itself cannot be simulated at all with smaller resources.
This chapter will focus on how to write parallel programs and make effective use of
your computing resources. We will not be focusing on the underlying computer
279
www.it-ebooks.info
science that enables parallelism, although computer science topics in parallel comput‐
ing are fascinating. You do not need to know how to build a bicycle to be able to ride
one. By analogy, you can create computational physics programs without knowing
how to build your own supercomputer. Here, we opt to cover what you need to know
to be effective in your research. Let’s start with some terminology.
Scale and Scalability
While parallelism is often necessary for large problems, it is also useful for smaller
problems. This notion of size is referred to as scale. Computational scale is a some‐
what ambiguous term that can be measured in a few different ways: the number of
processes, the computation rate, and the data size.
A simple definition for scale is that it is proportional to the number of P processes
that are used. This provides the maximum degree of parallelism that is possible from
the point of view of the computer.
It is better to talk about the number of processes used rather than
the number of processors. This is because the number of processes
is independent of the hardware.
Another popular measure of parallelism is the number of floating-point operations
per second, or FLOPS. This is the rate at which arithmetic operations (addition, mul‐
tiplication, modulus, etc.) happen on float data over the whole machine. Since a lot of
scientific code mostly deals with floats, this is a reasonable measure of how fast the
computer can do meaningful computations. FLOPS can sometimes be misleading,
though. It makes no claims about integer operations, which are typically faster and
important for number theory and cryptography. It also makes no claims about string
operations, which are typically slower and important for genomics and biology. The
use of graphics processing units (GPUs) is also a way to game the FLOPS system.
GPUs are designed to pump through large quantities of floating-point operations, but
they need special programming environments and are not great in high-data situa‐
tions. FLOPS is a good measure of how fast a computer can work in an ideal situa‐
tion, but bear in mind that obtaining that ideal can be tricky.
The final measure of scale that is commonly used is how much data is generated as a
result of the computation. This is easy to measure; simply count how many bytes are
written out. This tells you nothing of how long it took to generate those bytes, but the
data size is still important for computer architects because it gives a scale of how
much RAM and storage space will be required.
There are two important points to consider given the different scale metrics:
280
| Chapter 12: Performing in Parallel
www.it-ebooks.info
1. A metric tends to be proportional to the other metrics—a machine that is large
on one scale is often large on the others.
2. Achieving a certain computation scale is preceded by having achieved lower
scales first.
For example, point 1 states that when you have more processes available, you are
capable of more FLOPS. Point 2 says that you cannot store a petabyte without storing
a terabyte, you cannot store a terabyte without storing a gigabyte, and so on.
Together, these points imply that you should try to scale up your code slowly and
methodically. Start with the trivial case of trying to run your code on one process.
Then attempt 10 processes, and then 100, and up. It may seem obvious, but do not
write code that is supposed to work on a million cores right out of the gate. You can’t,
and it won’t. There will be bugs, and they will be hard to track down and resolve.
Slowly scaling up allows you to address these issues one by one as they emerge, and
on the scale at which they first appear.
Scalability is an indication of how easy or hard it is to go up in scale. There are many
ways of measuring scalability, but the one that applies the most to computing is run‐
time performance. For a given machine, scalability is often measured in strong and
weak forms.
Strong scaling is defined as how the runtime changes as a function of the number of
processors for a fixed total problem size. Typically this is measured by the speedup, s.
This is the ratio of time it takes to execute on one processor, t1, to the time it takes to
execute on P processors, tP:
sP =
t1
tP
In a perfectly efficient system, the strong scaling speedup is linear. Doubling the
number of processors will cause the problem to run in half the time.
Weak scaling, on the other hand, is defined as how the runtime changes as a function
of the number of processors for a fixed problem size per processor. This is typically
measured in what is known as the sizeup, z. For a problem size N , the sizeup is
defined as:
zP =
t1 N P
×
tP N 1
In a perfectly weak scaling system, the sizeup is linear. Doubling the number of pro‐
cessors will double the size of the problem that may be solved.
Scale and Scalability
www.it-ebooks.info
|
281
Both strong and weak scaling are constrained by Amdahl’s law. This law follows from
the observation that some fraction of an algorithm–call it α–cannot be parallelized.
Thus, the maximum speedup or sizeup possible for P processors is given as:
max s P =
1
α−
1−α
P
Taking this to the limit of an infinite number of processors, the maximum possible
speedup is given as:
max s = lim
P
∞
1
α−
1−α
P
=
1
α
Thus, Amdahl’s law states that if, for example, 10% of your program is unparalleliza‐
ble (α = 0 . 1), then the best possible speedup you could ever achieve is a factor of 10.
In practice, it is sometimes difficult to know what α is in more sophisticated algo‐
rithms. Additionally, α is usually much smaller than 0.1, so the speedups achievable
are much greater. However, Amdahl’s law is important because it points out that there
is a limit.
What matters more than problem scale, though, is the algorithm. The next section
discusses the types of parallel problems that exist.
Problem Classification
Certain algorithms lend themselves naturally to parallelism, while others do not.
Consider summing a large array of numbers. Any part of this array may be summed
up independently from any other part. The partial sums can then be summed
together themselves and achieve the same result as if the array had been summed in
series. Whether or not the partial sums are computed on the same processor or at the
same time does not matter. Algorithms like this one with a high degree of independ‐
ence are known as embarrassingly parallel problems. Other embarrassingly parallel
examples include Monte Carlo simulations, rendering the Mandelbrot set, and certain
optimization techniques such as stochastic and genetic algorithms. For embarrass‐
ingly parallel problems, the more processors that you throw at them, the faster they
will run.
On the other hand, many problems do not fall into this category. This is often
because there is an unavoidable bottleneck in the algorithm. A classic operation that
is difficult to parallelize is inverting a matrix. In general, every element of the inverse
matrix is dependent on every element in the original matrix. This makes it more diffi‐
cult to write efficient parallel code, because every process must know about the entire
282
| Chapter 12: Performing in Parallel
www.it-ebooks.info
original matrix. Furthermore, many operations are the same or similar between
elements.
All hope is not lost for non-embarrassingly parallel problems. In most cases there are
mathematical transformations you can make to temporarily decrease the degree of
dependence in part of the problem. These transformations often come with the cost
of a little excess computing elsewhere. On the whole though, the problem solution
will have faster runtimes due to increased parallelism. In other cases, the more you
know about your data the more parallelism you can eke out. Returning to the inverse
matrix example, if you know the matrix is sparse or block diagonal there are more
parallelizable algorithms that can be used instead of a fully generic solver. It is not
worth your time or the computer’s time to multiply a bunch of zeros together.
At large scales, these classifications have names based on the architecture of the
machines that run them rather than the properties of the algorithm. The better
known of these is high-performance computing (HPC). Such machines are built to run
non-embarrassingly parallel problems. When people talk about supercomputers, they
are referring to HPC systems. As a rule, all of the nodes on an HPC system are the
same. Each node has the same number of CPUs and GPUs and the same amount of
memory, and each processor runs the same operating system. Though the nodes are
connected together in a predetermined topology (rings, tauruses, etc.), node homoge‐
nization gives the illusion that any subset of nodes will act the same as any other sub‐
set of nodes of the same size.
You should consider using HPC systems when your problems are large and not
embarrassingly parallel, or they are large and you need the node homogeneity. That
said, RAM per node has been on a downward trend in HPC, so if you have a highmemory application, you have extra work to do in making the algorithm data parallel
in addition to compute parallel. Because they are a little trickier to program, but not
difficult conceptually, we will largely skip discussing data-parallel algorithms here.
HPC’s lesser-known sibling is called high-throughput computing (HTC). As its name
implies, HTC is designed to pump through as many operations as possible with little
to no communication between them. This makes HTC ideally suited to embarrass‐
ingly parallel problems. Nodes in an HTC system need not be the same. In some
cases, they do not even have to share the same operating system. It is incorrect to
think of an HTC system as a single machine. Instead, it is a coordinated network of
machines. When these machines are spread over a continent or the entire world,
HTC is sometimes known as distributed computing.
In both HPC and HTC systems, the most expensive component of a parallel algo‐
rithm is communication. The more frequently you have to communicate between
processes, and the more data that must be sent, the longer the calculation takes. This
almost invariably means that one node will be waiting while another node finishes. In
HPC, you can minimize communication time by trying to place communicating
Problem Classification
www.it-ebooks.info
|
283
nodes topologically close to each other. In HTC systems, there are only two phases
with communication—task initialization and return. For short tasks, though, this
communication time can dominate the execution time. So, in HTC it is always a good
idea to perform as much work as possible on a node before returning.
Now that you have an understanding of the problem space, you can learn how to
solve actual problems in parallel.
Example: N-Body Problem
Throughout the rest of this chapter we will be exploring parallelism through the lens
of a real-world problem. Namely, we will be looking at the N-body problem. This is a
great problem for computational physics because there are no analytical solutions
except when N=2, or in certain cases when N=3. Numerical approximations are the
best that we can get. Furthermore, as we will see in the following sections, a major
portion of this problem lends itself nicely to parallelism while another portion does
not.
The N-body problem is a generalization of the classic 2-body problem that governs
the equations of motion for two masses. As put forward by Newton’s law of gravity,
we see:
m1m2
dp
=G 2
dt
r
where p is momentum, t is time, G is the gravitational constant, m1 and m2 are the
masses, and r is the distance between them. In most cases, constant mass is a reason‐
able assumption. Thus, the force on the ith mass from N-1 other bodies is as follows:
mi
d2�i
dt
2
=
N
∑ G
j = 1, i ≠ j
m im j � j − � i
∥ � j − �i ∥
3
with �i being the position of the ith body. Rearranging the masses, we can compute
the acceleration, �i:
�i = G
N
∑
j = 1, i ≠ j
m j � j − �i
3
∥ � j − �i ∥
This term for the acceleration can then be plugged into the standard time-discretized
equations of motion:
284
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
�i, s = �i, s − 1Δt + �i, s − 1
�i, s = �i, s − 1Δt2 + �i, s − 1Δt + �i, s − 1
where � is the velocity, s indexes time, and Δt = ts − ts − 1 is the length of the time step.
Given the initial conditions �i, 0 and �i, 0 for all N bodies and a time step Δt, these
equations will numerically solve for the future positions, velocities, and accelerations.
Now that we have a statement of the N-body problem, the following sections solve
this problem in their own illuminating ways.
No Parallelism
When you’re implementing a parallel algorithm, it is almost always easier to imple‐
ment and analyze a sequential version first. This is because you can study, debug, and
get a general sense of the behavior of the algorithm without adding an extra variable:
the number of processors. This version is also almost always easier to program and
requires fewer lines of code. A common and reliable strategy is to write once in series
and then rewrite in parallel. Related to this is the somewhat subtle point that a
sequential algorithm is not the same as a parallel algorithm with P=1. Though the two
both run on one processor, they may have very different performance characteristics.
The following collection of functions implements a non-parallel solution to a single
time step of the N-body problem, as well as generating some initial conditions:
import numpy as np
def remove_i(x, i):
"""Drops the ith element of an array."""
shape = (x.shape[0]-1,) + x.shape[1:]
y = np.empty(shape, dtype=float)
y[:i] = x[:i]
y[i:] = x[i+1:]
return y
def a(i, x, G, m):
"""The acceleration of the ith mass."""
x_i = x[i]
x_j = remove_i(x, i)
m_j = remove_i(m, i)
diff = x_j - x_i
mag3 = np.sum(diff**2, axis=1)**1.5
result = G * np.sum(diff * (m_j / mag3)[:,np.newaxis], axis=0)
return result
def timestep(x0, v0, G, m, dt):
"""Computes the next position and velocity for all masses given
No Parallelism
www.it-ebooks.info
|
285
initial conditions and a time step size.
"""
N = len(x0)
x1 = np.empty(x0.shape, dtype=float)
v1 = np.empty(v0.shape, dtype=float)
for i in range(N):
a_i0 = a(i, x0, G, m)
v1[i] = a_i0 * dt + v0[i]
x1[i] = a_i0 * dt**2 + v0[i] * dt + x0[i]
return x1, v1
def initial_cond(N, D):
"""Generates initial conditions for N unity masses at rest
starting at random positions in D-dimensional space.
"""
x0 = np.random.rand(N, D)
v0 = np.zeros((N, D), dtype=float)
m = np.ones(N, dtype=float)
return x0, v0, m
We should not compute the acceleration from this mass to itself.
Compute the acceleration on the ith mass.
Update the locations for all masses for each time step.
Random initial conditions are fine for our sample simulator. In a real simulator,
you would be able to specify initial particle positions and velocities by hand.
Here, the function a() solves for the acceleration of the ith mass, timestep() advan‐
ces the positions and velocities of all of the masses, remove_i() is a simple helper
function, and initial_cond() creates the initial conditions for randomly placed unit
masses at rest. The masses are all placed within the unit cube. All of these functions
are parameterized for N masses in D dimensions (usually 2 or 3). The masses here are
treated as points and do not collide by hitting one another.
In most initial configurations, the masses will start accelerating toward one another
and thereby gain velocity. After this, their momentums are typically not aligned cor‐
rectly for the masses to orbit one another. They leave the unit cube, flying off in what‐
ever direction they were last pointed in.
Since this is for demonstration purposes, we will set the gravitational constant G=1. A
reasonable time step in this case is dt=1e-3. An example of the initial conditions and
the first time step may be seen in Figures 12-1 and 12-2.
286
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
Figure 12-1. Example N-body positions for bodies at rest (initial conditions)
The data for these figures is generated by the following code:
x0, v0, m = initial_cond(10, 2)
x1, v1 = timestep(x0, v0, 1.0, m, 1.0e-3)
More generally, we can also write a simple driver function that simulates S time steps.
The simulate() function takes the positions and velocities and updates them for
each time step:
def simulate(N, D, S, G, dt):
x0, v0, m = initial_cond(N, D)
for s in range(S):
x1, v1 = timestep(x0, v0, G, m, dt)
x0, v0 = x1, v1
No Parallelism
www.it-ebooks.info
|
287
Figure 12-2. Example N-body positions and velocities after one time step
We can measure the performance of this non-parallel algorithm by calling simu
late() for a variety of different N values and timing how long it takes. The following
snippet shows how to automate these timings using Python’s built-in time module.
Here, we scale the particles by increasing powers of two:
import time
Ns = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
runtimes = []
for N in Ns:
start = time.time()
simulate(N, 3, 300, 1.0, 1e-3)
stop = time.time()
runtimes.append(stop - start)
Here, 300 time steps were chosen because after this point almost all of the bodies have
left the original unit cube. The bodies are unlikely to engage in further interactions.
Three hundred time steps is also large enough that the overhead that comes from just
starting and running the simulations is minimized.
Intuition tells us that the N-body problem is order-N2. This is because for every new
body added, another iteration is added in the outer loop (the timestep() function)
and another iteration is added in the inner loop (the a() function). Thus, we would
expect the runtimes to be roughly parabolic as a function of N. Call the runtime of an
288
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
N-body simulation tN . Figure 12-3 shows the ratio between the runtimes of tN and t2,
the runtime of the 2-body problem.
Figure 12-3. Relative runtimes of a non-parallel N-body simulation
Figure 12-3 is more or less parabolic, especially for large values of N. Since this simu‐
lation is done in serial, it is a little excessive to go much beyond 8,192 bodies. More
telling, though, is the relative doubling time. This is defined as tN divided by the run‐
time of half as many bodies, t N . Since the problem is order-N2, if you double the
2
number of particles the runtime should go up by a factor of 4 no matter how many
particles there are in the simulation. Figure 12-4 displays this quantity.
No Parallelism
www.it-ebooks.info
|
289
Figure 12-4. Relative doubling runtimes of a non-parallel N-body simulation
From this figure, there seem to be two distinct regimes in the relative doubling time.
For large N, the relative doubling time does indeed seem to approach the magic 4
value. However for small N, this ratio is much closer to 2. In some cases it is even
below 2, which implies that adding bodies decreases the amount of time spent on
each body.
That Figure 12-4 is not a flat line close to 4 means that for low N, the overhead of the
simulation dominates. Python, numpy, copying memory, and other facets of the
implementation take up more of the runtime than the actual computation of the posi‐
tions, velocities, and accelerations. It is not until about 2,000 bodies that the effects of
N are really pertinent. Thus, for small N it probably is not worth parallelizing at all.
The return on investment would be too low. For larger N, though, it will make a world
of difference.
Let’s now take a look at our first parallel N-body solver.
Threads
Threads are most people’s first thought when they approach parallelism. However,
threading in Python is almost certainly the wrong tool for the job. We cover thread‐
ing here because it introduces some basic parallel concepts. You should also know
why you should not use it.
290
| Chapter 12: Performing in Parallel
www.it-ebooks.info
You should not use threads in Python for most scientific tasks! You
should read this section anyway, though, because threading can be
useful outside of Python.
Threads are objects that perform work and are not blocked by the work that other
threads may be doing simultaneously. Executing code in a thread does not stop
another thread from executing its own code. Threads may communicate with each
other through their state. Threads may spawn child threads, and there is no limit to
how many threads a program may run. There is always at least one thread in Python
—the main thread.
Threads are one-time use. They have a run() method that, when it completes the
thread, is no longer considered alive. This run() method is not directly called but is
instead called implicitly when the start() method is called. This lets threads defer
their work until a later point in time when the work has been properly set up.
Threads cannot be killed externally; instead, the run() method must be allowed to
complete. The exception to this rule is that if the thread is a daemon, the thread will
die when the Python program itself exits.
Threads can be found in the threading module in the Python standard library. This
exposes an interface that is similar to the threading interfaces found in other lan‐
guages, such as Java or C.
Threading’s big gotcha is that all threads execute in the same process as Python itself.
This is because Python is an interpreted language, and the Python interpreter itself
lives in only one process. This means that even though the threads may not block
each other’s execution, their combined speed is limited to that of a single processor
on the machine. Number crunching for the most part is CPU-bound—that is, it is
limited by the speed of the processor. If the computer is already running at top speed,
adding more things for it to do does not make it go any faster. In fact, adding more
tasks often hurts execution time. The processor becomes crowded.
The one-processor limit to threads is due to a detail of the way that the standard
Python interpreter (called CPython, because it is written in C) is written. This detail is
called the global interpreter lock, or GIL. The GIL determines which thread is cur‐
rently running and when to switch from executing one thread to executing another.
Several attempts to have Python without the GIL have been proposed over time. Most
notable is the PyPy project. None of these, however, have usurped the popularity of
the classic CPython interpreter. CPython remains especially dominant in the scien‐
tific computing world.
So why do people use threads? For high-latency tasks, like reading a file from disk or
downloading information over a network, most of the time the program is just wait‐
Threads
www.it-ebooks.info
|
291
ing to get back the next chunk of data before it can proceed. Threads work well in
these cases because the spare time can be put to good use executing other code.
However, in CPU-bound tasks, the processor is already as busy as it can be, and
threads cause more problems than they solve. Except for reading in the occasional
large dataset, scientific code is overwhelmingly CPU-bound.
Even though you probably should not use them, let’s take a look at how the N-body
problem is solved with threads. The first and easiest tactic for parallelization is to
identify the parts of the code that do not depend on one another. With complete
independence, the processes do not have to communicate with one another, so paral‐
lelism should come easy. In some algorithms this independence manifests as an inner
loop and in some it is an outer loop. In some algorithms you can even switch which is
the inner and which is the outer loop, to better fit your needs.
In the N-body problem, the loop in the timestep() function is the most easily paral‐
lelized. This allows us to only modify this one function. We can leave the implemen‐
tations of the remove_i(), a(), and initial_cond() functions the same as they were
in “No Parallelism” on page 285.
Before we can modify the timestep() function, we need to define threads that
instead do the work of a single iteration. The following code implements a thread for
use with the N-body problem:
from threading import Thread
class Worker(Thread):
"""Computes x, v, and a of the ith body."""
def __init__(self, *args, **kwargs):
super(Worker, self).__init__(*args, **kwargs)
self.inputs = []
self.results = []
self.running = True
self.daemon = True
self.start()
def run(self):
while self.running:
if len(self.inputs) == 0:
continue
i, x0, v0, G, m, dt = self.inputs.pop(0)
a_i0 = a(i, x0, G, m)
v_i1 = a_i0 * dt + v0[i]
x_i1 = a_i0 * dt**2 + v0[i] * dt + x0[i]
result = (i, x_i1, v_i1)
self.results.append(result)
To use threading, we need to subclass Thread.
292
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
inputs is a buffer of work for this thread to churn through. We cannot call the
run() method directly, so inputs provides a way to send data to the thread.
A buffer of return values. Again, since we do not call the run() method our‐
selves, we have to have a place to store our results.
Setting this to False will cause the run() method to end safely.
Allow the thread to be terminated when Python exits.
Check if there is work to do.
The body of the original timestep() function.
First, we grab the Thread class from the threading module that comes with Python.
To make this useful to the N-body problem, we subclass it under the name Worker
and then override the constructor and the run() methods.
The constructor has two major purposes: setting up data structures used for commu‐
nication and starting the thread. The inputs and results lists are used to load work
onto the thread and get corresponding return values. The running flag is used to end
the life of the thread cleanly—it will not stop midway while computing a result. The
daemon flag is used to tell Python that this thread may be stopped when the program
exits. Finally, the call to the start() method at the end of the constructor immedi‐
ately makes the thread available for work when it is created.
While the running flag is true, the run() method will sit around waiting for work.
Work comes in the form of a nonempty inputs list. The elements of this list are
tuples that contain all of the arguments to a() as well as the time step value dt. When
work arrives, the run() method proceeds to compute the next values of acceleration,
velocity, and position for the ith body. The run() method then stores i, the next posi‐
tion, and the next velocity as a tuple in the results list, so that they can be read back
by whoever put the arguments in the inputs list. It is important to keep i in the
results list because it creates an unambiguous way of later determining which input
goes with which result.
Managing individual one-time-use threads can be painful. Creating and starting
threads can be expensive. It is best if thread objects can be reused for as long as possi‐
ble. To handle this, a pattern known as thread pools is extremely common. The idea
here is that the pool is created with some fixed size, and it is then responsible for cre‐
ating, starting, farming out work to, stopping, and finally deleting the threads it man‐
ages. Python does not provide a thread pool to use off the shelf. However, they are
fairly easy to implement. The following is a pool that manages instances of the Worker
class:
Threads
www.it-ebooks.info
|
293
class Pool(object):
"""A collection of P worker threads that distributes tasks
evenly across them.
"""
def __init__(self, size):
self.size = size
self.workers = [Worker() for p in range(size)]
def do(self, tasks):
for p in range(self.size):
self.workers[p].inputs += tasks[p::self.size]
while any([len(worker.inputs) != 0 for worker in self.workers]):
pass
results = []
for worker in self.workers:
results += worker.results
worker.results.clear()
return results
def __del__(self):
for worker in self.workers:
worker.running = False
Create new workers according to the size.
Evenly distribute tasks across workers with slicing.
Wait for all of the workers to finish all of their tasks.
Get back the results from the workers and clean out the workers.
Return the complete list of results for all inputs.
Stop the workers when the pool itself is shut down.
Instantiating a Pool creates as many workers as were requested and keeps references
to them for future use. The do() method takes a list of tasks—all of inputs that we
want to run—and appends them to the workers’ inputs lists. The tasks are appended
using slicing to spread them out among all workers as evenly as possible. The pool
then waits until the workers do not have anything left in their inputs lists. Once all of
the work has been completed, the results are aggregated together, the workers’
results lists are cleared for future work, and all of the results together are returned.
Note that the results are not guaranteed to be in the order of the tasks that spawned
them. In most cases, the order will be unpredictable. Lastly, the pool is responsible for
stopping the workers from running once the pool itself is shut down in the delete
method.
294
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
Using the Worker and Pool classes, our timestep() function can be rewritten to now
accept a pool object to use for its computation. Note that since the results come back
in an effectively random order, we must reconstruct them according to the i value.
Doing so, as well as setting up the tasks, is fairly simple and is very cheap compared
to the actual computation. The following reimplements the timestep() function:
def timestep(x0, v0, G, m, dt, pool):
"""Computes the next position and velocity for all masses given
initial conditions and a time step size.
"""
N = len(x0)
tasks = [(i, x0, v0, G, m, dt) for i in range(N)]
results = pool.do(tasks)
x1 = np.empty(x0.shape, dtype=float)
v1 = np.empty(v0.shape, dtype=float)
for i, x_i1, v_i1 in results:
x1[i] = x_i1
v1[i] = v_i1
return x1, v1
Create a task for each body.
Run the tasks.
Rearrange the results since they are probably out of order.
Rewriting timestep() necessitates rewriting the simulate() function to create a
Pool. To make scaling easy, simulate() should also be parameterized based on the
number of workers, P. Parallel algorithms need to be told with what degree of paral‐
lelism they should be run. A pool-aware version of simulate() is shown here:
def simulate(P, N, D, S, G, dt):
x0, v0, m = initial_cond(N, D)
pool = Pool(P)
for s in range(S):
x1, v1 = timestep(x0, v0, G, m, dt, pool)
x0, v0 = x1, v1
Unlike in “No Parallelism” on page 285, where we investigated how the N-body prob‐
lem behaved as a function of the number of bodies, here we can determine how the
simulation performs as a function of P. The following snippet has a fixed total prob‐
lem size of 64 bodies over 300 time steps:
Ps = [1, 2, 4, 8]
runtimes = []
for P in Ps:
start = time.time()
simulate(P, 64, 3, 300, 1.0, 1e-3)
stop = time.time()
runtimes.append(stop - start)
Threads
www.it-ebooks.info
|
295
In a perfect world, the strong scaling would have a speedup factor of 2x for every time
P is doubled. However, the results here, shown in Figure 12-5, demonstrate that the
speedup is actually a slowdown.
Figure 12-5. Speedup in threaded N-body simulation
Why does this happen? Every time we add threads, we add more load to the CPU.
The single processor then has to spend more time figuring out which thread should
be allowed to execute. As you can see, this gets very ugly very fast. Adding more
threads to CPU-bound tasks does not enable more parallelism; it causes less. Even
when there is only one worker thread, the main thread still exists. This causes the P=1
case that performs the best here to still be around 2.5 times slower than the equivalent
non-parallel case seen in “No Parallelism” on page 285. These excessive slowdowns
are why you should avoid threads.
In the next section, we’ll see a strategy that actually does yield real speedups.
Multiprocessing
Multiprocessing is Python’s way of handing off the responsibility of scheduling paral‐
lel tasks to the operating system. Modern operating systems are really good at multi‐
tasking. Multitasking allows for the sum of all processes to vastly exceed the resource
limits of the computer. This is possible because multitasking does not necessarily
296
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
allow for all processes to be simultaneously active. Any time you are browsing the
Web, playing music, storing data on a flash drive, and running your fancy new USB
toaster at the same time, you and your computer are multitasking.
The capability for multitasking is central to being able to use a computer for common
tasks. On a single-CPU system, only one process can be physically executing at a
time. A multitasking OS may pause this process and switch contexts to another pro‐
cess. Frequent, well-timed context switching gives the illusion of several processes
running at once. When there are many processors available, multitasking can easily
distribute a process across them, granting a degree of parallelism for free.
Forgetting filesystems, device drivers, and other human interactions, it is common
for people to now view the main job of the operating system as scheduling tasks. At
this point, most operating systems have extremely well-implemented schedulers.
From Python’s perspective, why not let someone else take care of this hard problem?
It is, after all, a solved hard problem. For CPU-intensive tasks, multitasking is exactly
what you should do.
What is called “threading” in other languages is actually imple‐
mented more similarly to multiprocessing in Python.
Creating or spawning a new OS-level process is handled by Python. However, note
that on POSIX (Unix-like—i.e., Linux and Mac OS X) machines, all processes are
forks of the processes that spawned them. A fork inherits the environment of its par‐
ent. However, any modifications to the environment that the fork makes will not be
reflected back in the parent process. Forks may spawn their own forks, which will get
the modified environment. Killing a parent process will kill all of its forks. So, when
Python spawns more processes, they all see the environment that the main Python
process had. Multiprocessing interacts more directly with the OS, so other OS-level
concepts that were seen in Chapter 1, such as piping, sometimes make an appearance
here too.
Multiprocessing in Python is implemented via the standard library multiprocessing
module. This provides a threading-like interface to handling processes. There are
two major distinctions, though:
1. Multiprocessing cannot be used directly from an interactive interpreter. The
main module (__main__) must be importable by the forks.
2. The multiprocessing module provides a Pool class for us. We do not need to
write our own.
Multiprocessing
www.it-ebooks.info
|
297
The Pool class has a number of different methods that implement parallelism in
slightly different ways. However, the map() method works extraordinarily well for
almost all problems.
Pool.map() has a similar interface to the built-in map() function. It takes two argu‐
ments—a function and an iterable of arguments to pass into that function—and
returns a list of values in the same order that was given in the original iterable.
Pool.map() blocks further execution of Python code until all of the results are ready.
The major limitation of Pool.map() is that the function that it executes must have
only one argument. You can overcome this easily by storing the arguments you need
in a tuple or dictionary before calling it.
For the N-body problem, we no longer need the Worker and Pool classes that we used
in “Threads” on page 290. Instead, we simply need a timestep_i() function that
computes the time step evolution of the ith body. To make it easier to use with multi
processing, timestep_i() should only have one argument. The following code
defines this function:
from multiprocessing import Pool
def timestep_i(args):
"""Computes the next position and velocity for the ith mass."""
i, x0, v0, G, m, dt = args
a_i0 = a(i, x0, G, m)
v_i1 = a_i0 * dt + v0[i]
x_i1 = a_i0 * dt**2 + v0[i] * dt + x0[i]
return i, x_i1, v_i1
Unpack the arguments to the original timestep() function.
The body of the original timestep() function.
Note that the first operation here is to unpack args into the variables we have come
to know for this problem. Furthermore, the actual timestep() function from
“Threads” on page 290 must be altered slightly to account for the new kind of process
pool and the timestep_i() function. Mostly, we need to swap out the old do() call
for the new Pool.map() call and pass it timestep_i, as well as the tasks:
def timestep(x0, v0, G, m, dt, pool):
"""Computes the next position and velocity for all masses given
initial conditions and a time step size.
"""
N = len(x0)
tasks = [(i, x0, v0, G, m, dt) for i in range(N)]
results = pool.map(timestep_i, tasks)
x1 = np.empty(x0.shape, dtype=float)
v1 = np.empty(v0.shape, dtype=float)
for i, x_i1, v_i1 in results:
298
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
x1[i] = x_i1
v1[i] = v_i1
return x1, v1
Replace the old do() method with multiprocessing’s Pool.map().
All of the other functions from “Threads” on page 290 remain the same, including
simulate().
So how does multiprocessing scale? Let’s take the fairly regular computing situation of
programming on a dual-core laptop. This is not a whole lot of parallelism, but it is
better than nothing. We would expect that a pool with two processes would run twice
as fast as a pool with one process. At one process per physical processor, we would
expect peak performance. For greater than two processes we would expect some per‐
formance degradation, as the operating system spends more time context switching.
Example results can be seen in Figure 12-6. These were generated with the following
code:
import time
Ps = [1, 2, 4, 8]
runtimes = []
for P in Ps:
start = time.time()
simulate(P, 256, 3, 300, 1.0, 1e-3)
stop = time.time()
runtimes.append(stop - start)
As you can see in Figure 12-6, it is not precisely 2x faster, but is in the neighborhood
of 1.8x. The extra 0.2x that is missing goes to overhead in Python, processing forking,
and the N-body algorithm. As expected, for pools of size 4 and 8 the speedup is worse
than in the two processes case, though it is still a considerable improvement over the
one processor case. Finally, it is worth noting that the one process case is 1.3x slower
than the equivalent case with no parallelism. This is because of the overhead in set‐
ting up the parallel infrastructure. However, this initial 30% burden is quickly over‐
come in the two processor case. All of these results together indicate that
multiprocessing is certainly worth it, even if you overestimate the number of process‐
ors that you have by a bit.
Multiprocessing
www.it-ebooks.info
|
299
Figure 12-6. Speedup in multiprocessing N-body simulation on two processors
Multiprocessing is a great tool to use when you have a single machine and less than a
thousand physical processors. It is great for daily use on a laptop or on a cluster at
work. However, multiprocessing-based strategies do not scale up to supercomputers.
For that, we will need MPI, covered in the next section.
MPI
The gold standard for high-performance parallelism is MPI, the Message-Passing
Interface. As an interface MPI is a specification for how to communicate information
between various processes, which may be close to or very far from one another. The
MPI-3.0 Standard was released in September 2012. There are two primary open
source projects that implement MPI: MPICH and Open MPI. Since they implement
the same standard, these are largely interchangeable. They both take great care to
provide the MPI interface completely and correctly.
It is not an understatement to say that supercomputing is built on top of MPI. This is
because MPI is an abstraction for parallelism that is independent of the machine.
This allows physicists (and other domain experts) to learn and write MPI code and
300
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
have it work on any computer. Meanwhile, the architects of the supercomputers can
implement a version of MPI that is optimized to the machines that they are building.
The architects do not have to worry about who is going to use their version of MPI,
or how they will use it.
MPI is a useful abstraction for anyone who buys into its model. It is a successful
abstraction because almost everyone at this point does buy into it. MPI is huge and
very flexible, and we do not have space here to do it justice. It currently scales up to
the level of hundreds of thousands to millions of processors. It also works just fine on
a handful of processors. If you are serious about parallelism on even medium scales
(1,000+ processors), MPI is an unavoidable boon.
As its name states, MPI is all about communication. Mostly this applies to data, but it
is also true for algorithms. The basic elements of MPI all deal with how to communi‐
cate between processes. For a user, this is primarily what is of interest. As with all
good things, there is a Python interface. In fact, there are many of them. The most
commonly used one is called mpi4py. We will be discussing the mpi4py package rather
than the officially supported C, C++, or Fortran interfaces.
In MPI terminology, processes are called ranks and are given integer identifiers start‐
ing at zero. As with the other forms of parallelism we have seen, you may have more
ranks than there are physical processors. MPI will do its best to spread the ranks out
evenly over the available resources. Often—though not always—rank 0 is considered
to be a special “master” process that commands and controls the other ranks.
Having a master rank is a great strategy to use, until it isn’t! The
point at which this approach breaks down is when the master pro‐
cess is overloaded by the sheer number of ranks. The master itself
then becomes the bottleneck for doling out work. Reimagining an
algorithm to not have a master process can be tricky.
At the core of MPI are communicator objects. These provide metadata about how
many processes there are with the Get_size() method, which rank you are on with
Get_rank(), and how the ranks are grouped together. Communicators also provide
tools for sending messages from one processor and receiving them on other processes
via the send() and recv() methods. The mpi4py package has two primary ways of
communicating data. The slower but more general way is that you can send arbitrary
Python objects. This requires that the objects are fully picklable. Pickling is Python’s
native storage mechanism. Even though pickles are written in plain text, they are not
human readable by any means. To learn more about how pickling works and what it
looks like, please refer to the pickling section of the Python documentation.
NumPy arrays can also be used to communicate with mpi4py. In situations where
your data is already in a NumPy arrays, it is most appropriate to let mpi4py use these
MPI
www.it-ebooks.info
|
301
arrays. However, the communication is then subject to the same constraints as nor‐
mal NumPy arrays. Instead of going into the details of how to use NumPy and mpi4py
together, here we will only use the generic communication mechanisms. This is
because they are easier to use, and moving to NumPy-based communication does not
add anything to your understanding of parallelism.
The mpi4py package comes with a couple of common communicators already instan‐
tiated. The one that is typically used is called COMM_WORLD. This represents all of the
processes that MPI was started with and enables basic point-to-point communication.
Point-to-point communication allows any process to communicate directly with any
other process. Here we will be using it to have the rank 0 process communicate back
and forth with the other ranks.
As with multiprocessing, the main module must be importable. This is because MPI
must be able to launch its own processes. Typically this is done through the
command-line utility mpiexec. This takes a -n switch and a number of nodes to run
on. For simplicity, we assume one process per node. The program to run—Python,
here—is then followed by any arguments it takes. Suppose that we have written our
N-body simulation in a file called n-body-mpi.py. If we wish to run on four processes,
we would start MPI with the following command on the command line:
$ mpiexec -n 4 python n-body-mpi.py
Now we just need to write the n-body-mpi.py file! Implementing an MPI-based solver
for the N-body problem is not radically different from the solutions that we have
already seen. The remove_i(), initial_cond(), a(), timestep(), and timestep_i()
functions are all the same as they were in “Multiprocessing” on page 296.
What changes for MPI is the simulate() function. To be consistent with the other
examples in this chapter (and because it is a good idea), we will also implement an
MPI-aware process pool. Let’s begin by importing MPI and the following helpers:
from mpi4py import MPI
from mpi4py.MPI import COMM_WORLD
from types import FunctionType
The MPI module is the primary module in mpi4py. Within this module lives the
COMM_WORLD communicator that we will use, so it is convenient to import it directly.
Finally, types is a Python standard library module that provides base classes for
built-in Python types. The FunctionType will be useful in the MPI-aware Pool that is
implemented here:
class Pool(object):
"""Process pool using MPI."""
def __init__(self):
self.f = None
self.P = COMM_WORLD.Get_size()
302
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
self.rank = COMM_WORLD.Get_rank()
def wait(self):
if self.rank == 0:
raise RuntimeError("Proc 0 cannot wait!")
status = MPI.Status()
while True:
task = COMM_WORLD.recv(source=0, tag=MPI.ANY_TAG, status=status)
if not task:
break
if isinstance(task, FunctionType):
self.f = task
continue
result = self.f(task)
COMM_WORLD.isend(result, dest=0, tag=status.tag)
def map(self, f, tasks):
N = len(tasks)
P = self.P
Pless1 = P - 1
if self.rank != 0:
self.wait()
return
if f is not self.f:
self.f = f
requests = []
for p in range(1, self.P):
r = COMM_WORLD.isend(f, dest=p)
requests.append(r)
MPI.Request.waitall(requests)
requests = []
for i, task in enumerate(tasks):
r = COMM_WORLD.isend(task, dest=(i%Pless1)+1, tag=i)
requests.append(r)
MPI.Request.waitall(requests)
results = []
for i in range(N):
result = COMM_WORLD.recv(source=(i%Pless1)+1, tag=i)
results.append(result)
return results
def __del__(self):
if self.rank == 0:
for p in range(1, self.P):
COMM_WORLD.isend(False, dest=p)
A reference to the function to execute. The pool starts off with no function.
MPI
www.it-ebooks.info
|
303
The total number of processors.
Which processor we are on.
A method for receiving data when the pool has no tasks. Normally, a task is data
to give as arguments to the function f(). However, if the task is itself a function,
it replaces the current f().
The master process cannot wait.
Receive a new task from the master process.
If the task was a function, put it onto the object and then continue to wait.
If the task was not a function, then it must be a real task. Call the function on this
task and send back the result.
A map() method to be used like before.
Make the workers wait while the master sends out tasks.
Send all of the workers the function.
Evenly distribute tasks to all of the workers.
Wait for the results to come back from the workers.
Shut down all of the workers when the pool is shut down.
The purpose of the Pool class is to provide a map() method that is similar to the
map() on the multiprocessing pool. This class implements the rank-0-as-master strat‐
egy. The map() method can be used in the same way as for other pools. However,
other parts of the MPI pool operate somewhat differently. To start with, there is no
need to tell the pool its size. P is set on the command line and then discovered with
COMM_WORLD.Get_size() automatically in the pool’s constructor.
Furthermore, there will be an instance of Pool on each processor because MPI runs
the same executable (python) and script (n-body-mpi.py) everywhere. This implies
that each pool should be aware of its own rank so that it can determine if it is the
master or just another worker. The Pool class has to jointly fulfill both the worker
and the master roles.
304
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
The wait() method here has the same meaning as Thread.run() from “Threads” on
page 290. It does work when there is work to do and sits idle otherwise. There are
three paths that wait() can take, depending on the kind of task it receives:
1. If a function was received, it assigns this function to the attribute f for later use.
2. If an actual task was received, it calls the f attribute with the task as an argument.
3. If the task is False, then it stops waiting.
The master process is not allowed to wait and therefore not allowed to do real work.
We can take this into account by telling MPI to use P+1 nodes. This is similar to what
we saw with threads. However, with MPI we have to handle the master process
explicitly. With Python threading, Python handles the main thread, and thus the mas‐
ter process, for us.
The map() method again takes a function and a list of tasks. The tasks are evenly dis‐
tributed over the workers. The map() method is only runnable on the master, while
workers are told to wait. If the function that is passed in is different than the current
value of the f attribute, then the function itself is sent to all of the workers. Sending
happens via the “initiate send” (COMM_WORLD.isend()) call. We ensure that the func‐
tion has made it to all of the workers via the call to MPI.Request.waitall(). This
acts as an acknowledgment between the sender and all of the receivers. Next, the tasks
are distributed to their appropriate ranks. Finally, the results are received from the
workers.
When the master pool instance is deleted, it will automatically instruct the workers to
stop waiting. This allows the workers to be cleaned up correctly as well. Since the
Pool API here is different enough, a new version of the top-level simulate() func‐
tion must also be written. Only the master process should be allowed to aggregate
results together. The new version of simulate() is shown here:
def simulate(N, D, S, G, dt):
x0, v0, m = initial_cond(N, D)
pool = Pool()
if COMM_WORLD.Get_rank() == 0:
for s in range(S):
x1, v1 = timestep(x0, v0, G, m, dt, pool)
x0, v0 = x1, v1
else:
pool.wait()
Lastly, if we want to run a certain case, we need to add a main execution to the bot‐
tom of n-body-mpi.py. For 128 bodies in 3 dimensions over 300 time steps, we would
call simulate() as follows:
if __name__ == '__main__':
simulate(128, 3, 300, 1.0, 1e-3)
MPI
www.it-ebooks.info
|
305
Given MPI’s fine-grained control over communication, how does the N-body prob‐
lem scale? With twice as many processors, we again expect a 2x speedup. If the num‐
ber of MPI nodes exceeds the number of processors, however, we would expect a
slowdown due to managing the excess overhead. Figure 12-7 shows a sample study
on a dual-core laptop.
Figure 12-7. Speedup in MPI N-body simulation
While there is a speedup for the P=2 case, it is only about 1.4x, rather than the hopedfor 2x. The downward trend for P>2 is still present, and even steeper than with multi‐
processing. Furthermore, the P=1 MPI case is about 5.5x slower than the same
simulation with no parallelism. So, for small simulations MPI’s overhead may not be
worth it.
Still, the situation presented here is a worst-case scenario for MPI: arbitrary Python
code with two-way communication on a small machine of unspecified topology. If we
had tried to optimize our algorithm at all—by giving MPI more information or by
using NumPy arrays to communicate—the speedups would have been much higher.
These results should thus be viewed from the vantage point that even in the worst
case, MPI is competitive. MPI truly shines in a supercomputing environment, where
everything that you have learned about message passing still applies.
306
|
Chapter 12: Performing in Parallel
www.it-ebooks.info
Parallelism Wrap-up
Parallelism is a huge topic, and we have only scratched the surface of what can be
accomplished, what mechanisms for parallelism exist, and what libraries implement
the various strategies. Having read this chapter, you are well prepared to go forth and
learn more about how to implement parallel algorithms in the context of scientific
computing. This is very different from the more popular web-based parallelism that
permeates our modern lives. The following list presents some excellent parallel sys‐
tems that you may find interesting or helpful to explore, beyond the essentials cov‐
ered here:
OpenMP
Preprocessor-based, easy to use, low-level parallelism for C, C++, and Fortran
GNU Portable Threads
Cross-platform thread system for C, C++, and Fortran
IPython Parallel
The parallel architecture that IPython uses, based on ZeroMQ
Twisted
Event-driven parallelism for Python web applications
You should now be familiar with the following ideas:
• There are many ways to measure scale.
• Certain problems are embarrassingly easy to make parallel, while others are very
difficult.
• High-performance computing systems are built to handle non-embarrassingly
parallel problems.
• High-throughput computing systems are best used for embarrassingly parallel or
heterogeneous problems.
• Non-parallel algorithms are faster than parallel code used with one process.
• Stay away from Python threads when number crunching.
• Multitasking is great for problems involving up to around a thousand processes.
• Use MPI when you really need to scale up.
Now that you know how to write software in serial and parallel, it is time to talk
about how to get your software to other computers.
Parallelism Wrap-up
www.it-ebooks.info
|
307
www.it-ebooks.info
CHAPTER 13
Deploying Software
So far, we have mostly been concerned with how to write software that solves physics
problems. In this chapter, though, we will discuss how writing software that runs reli‐
ably—especially at scale—can prove almost as challenging.
Most software developers want users for their code. For scientific programs this is
doubly true, since science runs on a reputation system. Without users, code can
hardly be called reproducible. However, even in the event that you do not wish for
users, users are still extraordinarily helpful. A new user is a fresh set of eyes that can
point out the blind spots of a program better than anyone else. What is broken a new
user will break. What is unintuitive a new user will not understand. This process
starts at installation.
New users should realize that if they do not understand something
about a project after reading the documentation, then the moral
failing lies with the developers and not with themselves. Good
users in this situation will then kindly report their struggles back to
the developers.
That said, you are your first user. Writing code is a distinctly different process from
running the code. Even with tools to mitigate this distinction, such as testing, the dif‐
ference is never completely gone. Deploying code that you wrote for the first time is
an exciting experience. It probably will not work on the first try. As a developer, it is
easy to address the feedback that you yourself have as a user. You can iterate through
this develop-then-use cycle as many times as necessary until the code really does work
(for you, at least).
What confounds the deployment problem is that every system is different. Every lap‐
top, desktop, phone, operating system, and environment has its own history. This
309
www.it-ebooks.info
happens naturally as people use their computers. To make matters worse, oftentimes
the system might describe itself incorrectly. This can make installing and running on
a new system painful. Deployment often feels like an approximate art rather than an
exact science. Trying from the start to ensure that your software will run in as many
places as reasonably possible is the only way to preserve your sanity in the long term.
With a lot of attention and care, brand new systems can be forced to be the same.
This is incredibly useful to developers: if the software works on one copy of the sys‐
tem, it will likely work on all other copies. Strong guarantees like this often come
from virtualization in one form or another.
Virtualization is the act of isolating the software that you wish to run from the system
that is running it. This comes in two forms. A virtual machine allows for an entire
operating system to be installed and booted within another operating system. A vir‐
tual environment creates a new, clean space to run code in that is safely separated
from other parts of the operating system and environment.
Large-scale deployments of virtualizations can be found on supercomputers and in
the cloud. Supercomputers and smaller clusters often have login machines that users
connect to in order to launch jobs, compile their code, and perform other tasks. How‐
ever, the nodes that execute the software typically all run the same well-defined oper‐
ating system and environment. In the cloud, different virtualizations might exist to
fill different needs of an application. In all cases, the virtual environment is well
defined.
What these large deployments share is that the users are not allowed to touch the
software or environment directly. Users will break things. Such systems are too intri‐
cate and expensive to risk by giving even experienced people direct access. Removing
users from the mix removes a sure source of error. Virtualization in large computing
settings rightly remains a tool for developers. However, developers can prepackage a
virtualization and give it to users as a common starting point.
Deployment is a process, a struggle. Users want your software to work for them with
whatever they bring to the table. Users running Windows 98 Second Edition do not
care if your code works perfectly on a prerelease version of Linux. As in any negotia‐
tion, you will end up meeting somewhere in the middle, but it will likely be much
closer to the user’s side.
Figuring out what works for you as a developer and your users at the same time is
one of the great challenges in software engineering. Its importance and difficulty can‐
not be understated. Adversity breeds success, however, and the rewards for trying are
huge for software deployment. This chapter covers various modern tools that are
used to help in the deployment process.
310
|
Chapter 13: Deploying Software
www.it-ebooks.info
Deploying the Software Itself
The first stage of deployment is often to figure out how to package the software. This
involves creating a file that is distributable to a wide audience. Once users have this
file, they can run the code after a few special commands to install its contents on their
systems.
The internal structure of the package, how the package is distributed, and how it is
installed by the user all vary by package manager. Package managers are special pieces
of software that are responsible for installing other software on a user’s computer.
Most operating systems now come with at least one package manager preinstalled.
Lacking a package manager, you can always fall back to giving users a download link
to the source code and have them follow instructions to install your code manually.
This tends to limit your users to other developers.
For our purposes here, package management falls into three broad distribution
categories:
1. Source-based
2. Binary
3. Virtualization
Source-based distributions are an automated extension of the “give people a link to
the code” idea. The package manager will download the source code and install it
onto the user’s machine. For dynamic languages this is fine. Installation is fast and
errors will crop up at runtime, if relevant. For compiled languages, source-based dis‐
tribution is somewhat out of vogue. This is because it requires users to have a com‐
piler available on their systems. This is a reasonable assumption on most Linux
systems but almost categorically false on Mac and Windows systems, where it is diffi‐
cult to get a compiler working in the first place.
For this reason, binary package management has proven more successful for both
dynamic and compiled languages. With this approach, the developers compile the
code into its binary form for every combination of architectures that they wish to
support. Architectures are at minimum specified by word size (32 bits, 64 bits) and
operating system (Linux, Mac OS X, Windows). The results of compilation are then
added to a ZIP file, given to the user, and unzipped by the user into the appropriate
location. For users, this is fast and vastly reduces the possibility of error. For develop‐
ers, the extra work of creating many combinations of packages can be a headache.
Lastly, virtualizations can be packaged up and sent to users. These are similar to
binary packages in the sense that the developer expends up-front effort to create a
version of the code that should just work for the user. Virtualizations, however, go the
extra step of also giving the user the environment that the software was created in.
Deploying the Software Itself
www.it-ebooks.info
|
311
The user then has to manage the virtualization host to be able to run the software.
While this approach is easy for the developer to create and maintain, it is often a little
more work for the users. It also sometimes takes away from the users’ sense of agency
since the code is no longer directly running on their machines.
Which strategy should be pursued depends a lot on the expectations of potential
users. The rest of this section, while by no means comprehensive, presents a few pop‐
ular options for distributing and deploying software to users. Keeping with the theme
of the book, we focus on deploying Python software.
pip
Python packaging has had a long and storied history. The culmination of this is a tool
called pip, which is the Python Packaging Authority’s (PyPA) recommended way to
install Python code. In the past, there have been a broad spectrum of tools for creat‐
ing and managing packages. Each of these tools had its own proclivities. Almost all of
them were based on or compatible with the distutils module that lives in the
Python standard library. For most scientific software, though, distutils is an insuf‐
ficient choice: it handles compiling code from other languages poorly, but just well
enough to lull scientists into a false sense of security. Sincere attempts at fixing distu
tils for the scientific use case have been made, but none to date have been success‐
ful. Still, for a purely Python code package, pip is a good solution that works well.
The endorsement by PyPA gives pip a weight that will carry it forward into the
future.
pip is a command-line, source-based package manager that finds and downloads its
packages from the Python Package Index, or PyPI.1
For users, pip is easy. Here is an excerpt from the pip help:
$ pip -h
Usage:
pip [options]
Commands:
install
uninstall
list
show
search
help
Install packages.
Uninstall packages.
List installed packages.
Show information about installed packages.
Search PyPI for packages.
Show help for commands.
General Options:
1 For those current with their Monty Python humor, PyPI is sometimes pronounced cheeseshop.
312
|
Chapter 13: Deploying Software
www.it-ebooks.info
-h, --help
-V, --version
-q, --quiet
Show help.
Show version and exit.
Give less output.
For example, to install numpy, all the user has to do is execute the following
command:
$ pip install numpy
This will install numpy into the system Python. (On Linux, this command may need to
be run via sudo.) Alternatively, to install into user space, simply append the --user
switch. Other pip commands follow similarly.
As a developer, it is your job to create a pip-installable package and upload it to PyPI.
Luckily, if you have a source code directory structure like that which was presented in
“Packages” on page 60, then there are helpers that make this easy. Unfortunately,
picking which helper to use can be its own hardship. Historically, the distutils
package in the standard library was used to manage package installation. From here,
the setuptools package evolved to address certain issues in distutils. From setup
tools came the distribute package, which itself gave rise to distutils2. Attempts
at getting some version of these back into the standard library in some form have
failed. So, in the intervening years, different code packages have used whichever
option the developer felt was reasonable at the time. For more on this mess, please see
The Hitchhiker’s Guide to Packaging. pip, thankfully, simplifies our lives by recom‐
mending that we use setuptools. A strategy that is successful in almost all cases is to
use setuptools if it is available and fall back to distutils when it is not.
Before you romp off and start deploying packages willy-nilly, it is a
good idea to make sure that you are at least mostly following the
standard best practices of testing, documentation, and compliance
with a style guide. See Chapters 18, 19, and 20 for more details. Oh,
and of course, make sure that the code actually works! This is
harder than it sounds.
To use distutils or the other helpers to manage your Python package, create a file
called setup.py at the top level of your directory structure. This should live outside of
the modules that you want to install. Going back to the directory structure from
“Packages” on page 60, we would place the setup.py as follows:
setup.py
/compphys
|- __init__.py
|- constants.py
|- physics.py
/more
|- __init__.py
|- morephysics.py
Deploying the Software Itself
www.it-ebooks.info
|
313
||/raw
|||-
evenmorephysics.py
yetmorephysics.py
data.txt
matrix.txt
orphan.py
The setup.py file is at an equal level with the source code directory.
The sole purpose of setup.py is to import and call the setup() function with appro‐
priate arguments. This setup() function acts as a main function. It provides a
command-line interface for installing the software locally and also for making and
uploading packages to PyPI. The setup() function takes a number of keyword argu‐
ments that describe how the package source is laid out on the filesystem. It also
describes how the package should be installed on the user’s filesystem. The following
is an example of a setup.py that corresponds to file structure just shown:
import sys
try:
from setuptools import setup
have_setuptools = True
except ImportError:
from distutils.core import setup
have_setuptools = False
setup_kwargs = {
'name': 'compphys',
'version': '0.1',
'description': 'Effective Computation in Physics',
'author': 'Anthony Scopatz and Kathryn D. Huff',
'author_email': 'koolkatz@gmail.com',
'url': 'http://www.oreilly.com/',
'classifiers': [
'License :: OSI Approved',
'Intended Audience :: Developers',
'Programming Language :: Python :: 3',
],
'zip_safe': False,
'packages': ['compphys', 'compphys.more'],
'package_dir': {
'compphys': 'compphys',
'compphys.more': 'compphys/more',
},
'data_files': [('compphys/raw', ['*.txt'])],
}
if __name__ == '__main__':
setup(**setup_kwargs)
Use setuptools if we can.
314
|
Chapter 13: Deploying Software
www.it-ebooks.info
Use distutils otherwise.
Create the package metadata before we call setup().
Call setup() like it is a main function.
While most of the keyword arguments here are self-explanatory, a full and complete
description of all of the available options can be found in the distutils documenta‐
tion. The two primary commands of the setup script are build and install. That
may be seen from the help:
$ python setup.py -h
Common commands: (see '--help-commands' for more)
setup.py build
setup.py install
Global options:
--verbose (-v)
--quiet (-q)
--dry-run (-n)
--help (-h)
usage:
or:
or:
or:
setup.py
setup.py
setup.py
setup.py
will build the package underneath 'build/'
will install the package
run verbosely (default)
run quietly (turns verbosity off)
don't actually do anything
show detailed help message
[global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
--help [cmd1 cmd2 ...]
--help-commands
cmd --help
The build command builds the software into a build/ directory. This command will
create the directory if it does not already exist. The install command installs the
contents of the build directory onto the system (the computer you are currently
logged into). You can use the --user flag here to install into a user’s home directory
instead of installing systemwide. If the build command has not been run, the
install command will run build automatically. The following example shows how
to install the package from source and install into your user space:
$ python setup.py install --user
For pure Python code, source-only packages are easily created with the setup script
via the sdist command. This builds the package and puts it into a ZIP file that pip
knows how to unzip and install. The easiest way to create a package is by running this
command:
$ python setup.py sdist
At this point, you as a developer now have a ZIP file living on your computer. Thus,
sdist does not solve the problem of actually getting the software to users. PyPI is the
easiest service to use for Python source-based distribution. This is because setup
tools and distutils already plug into PyPI, and PyPI is free. Before you can upload
Deploying the Software Itself
www.it-ebooks.info
|
315
a package, you have to register it with the PyPI server. This ensures that no two pack‐
ages have exactly the same package name. Registration is accomplished through the
aptly named register command:
$ python setup.py register
This requires that you have an existing account on PyPI and that you provide some
metadata in setup.py about the package in its current state, such as the version num‐
ber. After the package is registered, you may copy it to PyPI with the upload com‐
mand. This command must follow an sdist command, as seen here:
$ python setup.py sdist upload
And that is all there is to it. Your Python package is ready for users to download and
install with pip. For more information, see the sdist command documentation and
the PyPI documentation.
While pip is great for pure Python source code, it falls a little flat for multilanguage
and compiled code projects. Up next, we will see a package manager that is better
suited to the needs of scientific software.
Conda
Conda is a cross-platform binary package manager that aims to solve many of the
problems with deploying scientific software. It was developed relatively recently by
Continuum in response the deficiencies in using pip and distutils for scientific
programs. Like all good things, Conda is free and open source. Conda (at least ini‐
tially) took inspiration from Enthought’s Canopy/EPD Python distribution and pack‐
age manager, and it has gained considerable popularity lately.
Conda has three properties that jointly distinguish it from most other package man‐
agers, and from pip in particular:
1. It is general enough to seamlessly handle multilanguage and non-Python code
projects.
2. It runs on any operating system, and especially Linux, Mac OS, and Windows.
3. It runs in the user’s home space by default and does not try to install into the
system.
Many package managers have one or two of these features, but all of them are
required to satisfactorily cover the full range of scientific computing use cases. In the
past, deploying packages required the developers to target as many Linux distribu‐
tions as they cared to (apt, portage, pacman, yum, etc.), create packages for Mac OS X
for homebrew and macports, and create a custom binary installer for Windows. After
all of this effort, the developer had to hope that users had administrative privileges on
316
| Chapter 13: Deploying Software
www.it-ebooks.info
the machines that they were trying to install the package on. Conda replaces all of
that with a single distribution and interface.
The fastest and most reliable way to get Conda is to use the Miniconda distribution.
This is a package that installs Conda and its dependencies (including Python) into a
directory of the user’s choosing. The default install location is ~/miniconda. From
here, Conda can be used to search for and install all other desired packages. This
includes updates to Conda itself.
Using Conda is very similar to using pip or other package managers. An abbreviated
version of the conda help is shown here:
$ conda -h
usage: conda [-h] [-V] command ...
conda is a tool for managing environments and packages.
positional arguments:
command
help
Displays a list of available conda commands and their help
strings.
list
List linked packages in a conda environment.
search
Search for packages and display their information.
create
Create a new conda environment from a list of specified
packages.
install
Install a list of packages into a specified conda
environment.
update
Update conda packages.
remove
Remove a list of packages from a specified conda environment.
clean
Remove unused packages and caches
build
Build a package from a (conda) recipe. (ADVANCED)
A user could install numpy via conda with the following command:
$ conda install numpy
This grabs numpy from the first channel that has a numpy package that matches the
user’s platform. A channel is a URL that points to a channel file, which in turn con‐
tains metadata about what packages are available on the channel and where to find
them. A channel can be added to the local Conda configuration with the following
command:
$ conda config --add channels http://conda.binstar.org/foo
Conda comes with a default channel that contains a wide variety of core packages.
Additional developer-supplied channels may be found at Binstar. Binstar serves the
same role as PyPI in that it is a place for developers to upload custom packages and
users to download them though Conda.
Conda’s package-building infrastructure is more general than that of distutils.
Rather than requiring a setup script, like pip, Conda looks for a directory of a certain
Deploying the Software Itself
www.it-ebooks.info
|
317
structure, known as a recipe. The name of the recipe directory is the same as the pack‐
age name. The recipe may contain the following files:
• build.sh (a bash build script for building on Linux, Mac, and other POSIX
systems)
• bld.bat (a batch script for building on Windows)
• meta.yaml (the metadata for the package in YAML format)
• run_test.py (an optional file for running tests)
• Optional patches to the source code
• Optional other files that cannot be included in other ways
This system is more general because you can write anything that you want in the
build scripts, as long as the resultant software ends up in the correct place. This
allows developers to fully customize their packages. However, this freedom can some‐
times be daunting. In Conda, it is explicitly the developer’s responsibility to ensure
that the code builds on all systems where packages are desired. You might not know
anything about building code on a Mac platform, but if you want a Conda package
for this operating system you have to figure it out. Please consult the Conda build
documentation for more information on writing Conda recipes, and see the condarecipes GitHub page for many working recipe examples.
Once you have a Conda recipe, it is easy to build a package for the system that you
are currently on. Simply pass the path to the recipe to the build command. Say we
had a compphys recipe in the current directory. This could be built with the following
command:
$ conda build compphys
To upload the new package to Binstar, you need to have a Binstar account and the
binstar command-line utility. You can get an account from the website, and you can
obtain the binstar command-line utility though Conda itself, as follows:
$ conda install binstar
This allows you to sign into Binstar from the shell using the binstar login com‐
mand. When you are logged in, any builds that you perform will also prompt you to
upload the package to Binstar under your account name. Thus, to both build and
upload a package, you just need to run the following commands:
$ binstar login
$ conda build compphys
Conda is the preferred binary package manager. It solves many of the problems
caused by language-based or operating system–based package managers. However,
neither source nor binary packages give the user the same execution environment
318
|
Chapter 13: Deploying Software
www.it-ebooks.info
that the developer had when the package was built. To distribute the environment
along with the code, we first turn to virtual machines.
Virtual Machines
A virtual machine, or VM, is a simulated computer that runs as a process on another
computer. The simulated computer is known as the guest and the computer that is
running the simulation is called the host. Everything about modern computer archi‐
tecture is replicated: the number of processors, memory, disk drives, external storage,
graphics processors, and more. This allows the guest VM to run any operating system
with nearly any specifications, completely independently of what the host happens to
be running as its operating system. As with other recursions, you can nest VMs inside
of VMs. You could run Ubuntu inside of Vista inside of Windows 7 inside of Red Hat
inside of Mac OS X, though your friends might question the sanity of such an
endeavor.
Calling a virtual machine a simulation is slightly incorrect. More correctly, VMs are
hypervisors. While it is true that the hardware interface is simulated, what happens
when code is executed on the guest is a little more involved.
Normally when you run software on a computer, the operating system’s kernel sched‐
ules time on a processor for your process to execute. The kernel is the Grand Poobah
of the operating system. It is the algorithm that decides what gets run, when it gets
run, and with how much memory. For a virtual machine, the hypervisor translates
requests for time and space from the guest kernel into corresponding requests made
to the host kernel.
To make these requests as speedy as possible, the hypervisor often has hooks into the
host’s operating system. The guest gets real compute time on the host’s processor, but
in such a way that it is completely hidden from other parts of the host. Thus, a VM
does not simulate the way a processor works; that would be horridly expensive.
Rather, it simulates the hardware that the guest believes is the computer.
Setting up a virtual machine requires you to specify all of the attributes of the
machine that you would like to create: the number of processors, the size of the hard
disk, the amount of memory, and so on. However, this process has been greatly sim‐
plified in recent years. It now takes just a handful of button clicks to get a new VM up
and running.
The effort of moving to a VM buys you reliability and reproducibility. For example,
before you touch it at all, a new VM with the latest version of Ubuntu is going to be
exactly the same as all other VMs with the latest version of Ubuntu (that have the
same virtual hardware specs). Furthermore, you can snapshot a virtual machine to a
file, store it, and ship it to your friends and colleagues. This allow them to restart the
VM on another machine exactly where you left it. These features are incredibly valua‐
Deploying the Software Itself
www.it-ebooks.info
|
319
ble for reproducing your work, tracking down bugs, or opening Microsoft Office
while on a Linux machine.
The virtualization software that we recommend is Oracle’s VirtualBox. It is the most
popular open source VM software for day-to-day users. VirtualBox may run as a host
on Linux, Mac OS X, Windows, and a number of other operating systems. Competi‐
tors include VMware, XenServer, and KVM. Figure 13-1 shows an example virtual
machine with a 64-bit Ubuntu host running a 32-bit version of Windows 7 as a guest.
Figure 13-1. VirtualBox with a 64-bit Ubuntu 14.04 host and a 32-bit Windows 7 guest
Virtual machines are incredibly important for large-scale deployment. We will see
their primary strength when we get to “Deploying to the Cloud” on page 325. Yet
even when you’re working in small groups, it is useful to have a working virtual
machine with your software. Upload this VM to the Internet, and users can download
it and try out your code without actually trying to learn how to install it! The only
major disadvantage with this VM-based distribution strategy is that the virtual
machines can become quite large. The snapshot of a VM can easily range from 1 to
10 GB, depending on the guest operating system and the size of the code base.
The size and startup times of VMs can be crippling for small, automated tasks. This
has led to another valuable approach to packaging and deployment.
320
|
Chapter 13: Deploying Software
www.it-ebooks.info
Docker
Motivated by the fact that virtual machines and hypervisors can be large and take a
long time to start up, a new technology called containers has recently rocketed to the
top of many people’s lists of deployment strategies. A container may be thought of as
operating system–level virtualization. Rather than the guest having to go through a
separate hypervisor, the operating system itself provides an interface for the guest to
request and access resources, safely shielded from other parts of the operating system.
This makes containers much lighter-weight and faster than traditional virtual
machines. It also adds restrictions, since the guest has to know about the host’s kernel
in order to access it.
Containers are typified by the Docker project. Initially released in March 2013,
Docker saw a stable v1.0 in June 2014. The velocity of its rise and adoption is
remarkable.
So why now? And why so fast? Without diving into too many of the details, Linux
Containers (LXC) have been around since Linux v2.6.24, which was released in Janu‐
ary 2008. However, they had some pretty large security holes. It was not until Linux
v3.8, released in February 2013, that these holes were fixed sufficiently to be viable for
large-scale deployment. The Docker project was started, and the rest is history.
The main limitation of containers is that the guest operating system must be the same
as the host operating system. Furthermore, LXC is a Linux-only technology right
now. Microsoft and Docker have recently announced a collaboration, so Windows
containers are on their way, but the Mac OS X platform has yet to start to catch up.
Critics of LXC will sometimes point out that other operating systems, such as
FreeBSD and Solaris, had container support long before Linux. For various historical
reasons, though, none of these container technologies gained the popularity that LXC
currently enjoys.
Since Docker is currently limited to Linux (and soon, hopefully,
Windows), feel free to skip the rest of this section if that is not your
platform of choice. What follows is a tutorial on how to use
Docker. You have now learned what you need to about containers
and their importance as a tool for collaboration; you can come
back to this section when you have a personal need for Docker
itself.
Using Docker is nearly synonymous with using Docker Hub, the online hosting ser‐
vice for Docker. Tight integration between the Docker command-line interface and
Docker the Internet service is part of what makes it so popular. If you do not already
have an account on Docker Hub, you should go create one now. It is easy and free.
Deploying the Software Itself
www.it-ebooks.info
|
321
Assuming you have Docker installed and usable (on Ubuntu, run the command sudo
apt-get install -y docker docker.io), you can run a simple “hello world” con‐
tainer with the following command:
$ sudo docker run ubuntu:14.04 echo "Hello, World!"
This executes docker with the run command, downloading from Docker Hub an
Ubuntu 14.04 image, as specified by the ubuntu:14.04. The remaining arguments are
any bash commands that you wish to run from inside the container. Here, we simply
run echo. If you have not run docker before, the output of this command will look
like the following:
$ sudo docker run ubuntu:14.04 echo "Hello, World!"
[sudo] password for scopatz:
Unable to find image 'ubuntu:14.04' locally
Pulling repository ubuntu
c4ff7513909d: Download complete
511136ea3c5a: Download complete
1c9383292a8f: Download complete
9942dd43ff21: Download complete
d92c3c92fa73: Download complete
0ea0d582fd90: Download complete
cc58e55aa5a5: Download complete
Hello, World!
Note that you need root privileges to run docker.
That’s a secret!
Docker will intelligently stash containers for future use.
The output of our echo command.
This shows that the Ubuntu image, which was only around 225 MB, could not be
found locally, so Docker automatically downloaded it from Docker Hub for us.
Docker then executed the echo command. Compared to downloading and setting up
a whole virtual machine, using Docker is easy. (Of course, this is a “hello world”
example, so it should be easy!) Naturally, there are other tweaks you can make to this
process, such as specifying private resources other than Docker Hub for finding con‐
tainers. Note that the image that was downloaded was cached for later use. Rerunning
the same command will not require downloading the image again. The second time
around, we should only see the output of echo:
$ sudo docker run ubuntu:14.04 echo "Hello, World!"
Hello, World!
322
|
Chapter 13: Deploying Software
www.it-ebooks.info
A list of all Docker images that are on the local system can be printed out with the
images command, as follows:
$ sudo docker images
REPOSITORY
TAG
ubuntu
14.04
IMAGE ID
c4ff7513909d
CREATED
2 weeks ago
VIRTUAL SIZE
225.4 MB
To avoid the business of downloading images when you want to run them, the pull
command allows you to download them ahead of time. Say we wanted to run the lat‐
est version of the WordPress blog. We could grab the corresponding image by passing
in wordpress:latest:
$ sudo docker pull wordpress:latest
Of course, you have to check the Docker Hub website to see what repositories
(ubuntu, wordpress, etc.) and what tags (14.04, latest, etc.) are available before you
can pull down an image.
You may also delete local Docker images from your system with the “remove image,”
or rmi, command. Suppose that we decided that we were not that into writing blogs
anymore and wanted to get rid of WordPress. This could be performed with the com‐
mand:
$ sudo docker rmi wordpress
Now, say that we wanted to add numpy to the ubuntu container so that it would be
readily available for future use. This kind of container customization is exactly what
Docker was built for, and it does it quite well. The first step is to launch the ubuntu
container in interactive mode. We do so by using the run command along with the -t
option to give us a terminal and the -i option to make it interactive. We will probably
want to run bash so that we can be effective once inside of the container, too. When
we run the following command we are dropped into a new interactive terminal
within the container:
$ sudo docker run -t -i ubuntu:14.04 /bin/bash
root@ae37c22b3c49:/#
While inside the container’s shell, Docker will automatically record anything that we
do. Let’s install Ubuntu’s package manager, install numpy, and then leave. These steps
are shown here:
$ sudo docker run -t
root@ae37c22b3c49:/#
...
root@ae37c22b3c49:/#
...
root@ae37c22b3c49:/#
-i ubuntu:14.04 /bin/bash
apt-get update
apt-get install -y python-numpy
exit
Note that while we are inside of the container, we have root privileges.
Deploying the Software Itself
www.it-ebooks.info
|
323
Back on our host machine, to truly save our work we have to commit the changes.
This creates a new image based on the original one, any modifications we may have
made, and metadata about the change that we supply. The docker commit command
takes the identifier that we saw in the container (here, ae37c22b3c49), a message
string via the -m option, an author name via the -a option, and a repository name for
the new image (here, ubuntu-numpy). When following along at home, be sure to sub‐
stitute your own Docker Hub username for . Putting this all together, we can
commit our changes with the command:
$ sudo docker commit -m "with numpy" -a "" ae37c22b3c49 /ubuntu-numpy
73188d24344022203bee5ef5d6cb31ccaa8b5f38085ae69fcf9502828220f81d
Our new container now shows up in the images list and is available for future use.
Running the images command from before now produces the following output on
my computer:
$ sudo docker images
REPOSITORY
TAG
IMAGE ID
CREATED
scopatz/ubuntu-numpy latest 73188d243440 About a minute ago
ubuntu
14.04
c4ff7513909d 2 weeks ago
VIRTUAL SIZE
225.4 MB
225.4 MB
Running docker with /ubuntu-numpy will save us time, because numpy is pre‐
loaded. We could also have built this same container using a Dockerfile. Docker
files are more effort to set up, though also more reproducible. For most regular
Docker tasks, the interactive shell is good enough. Please see the Docker documenta‐
tion for more details.
At this point, the ubuntu-numpy image still lives only on our computer. However, we
can upload it to Docker Hub for ourselves and others to freely use. This is done with
the push command. This command will ask you to log into Docker Hub if you have
not done so already. As you can see here, push requires that you specify the image
that you want to upload:
$ sudo docker push scopatz/ubuntu-numpy
The push refers to a repository [scopatz/ubuntu-numpy] (len: 1)
Sending image list
Please login prior to push:
Username: scopatz
Password:
Email: scopatz@gmail.com
Login Succeeded
The push refers to a repository [scopatz/ubuntu-numpy] (len: 1)
Sending image list
Pushing repository scopatz/ubuntu-numpy (1 tags)
511136ea3c5a: Image already pushed, skipping
1c9383292a8f: Image already pushed, skipping
9942dd43ff21: Image already pushed, skipping
d92c3c92fa73: Image already pushed, skipping
324
|
Chapter 13: Deploying Software
www.it-ebooks.info
0ea0d582fd90: Image already pushed, skipping
cc58e55aa5a5: Image already pushed, skipping
c4ff7513909d: Image already pushed, skipping
73188d243440: Image successfully pushed
Pushing tag for rev [73188d243440] on {https://cdn-registry-1.docker.io/v1/
repositories/scopatz/ubuntu-numpy/tags/latest}
Docker is a sophisticated tool that puts power into its users’ hands. We have only
scratched the surface of what it can do here and not discussed how it works internally.
However, those details are not needed to use Docker to deploy physics code and you
can already see that it is an efficient and masterful way of creating, customizing, and
sharing software. Rightfully so, Docker is quickly replacing other, more traditional
methods of software deployment.
Now that you know how to deploy software through a variety of mechanisms, let’s go
on to where you might deploy it.
Deploying to the Cloud
Lately, it seems impossible to avoid hearing about the cloud: cloud systems, cloud
architectures, cloud business solutions, and the like. As a group of remote computers
that combine to provide a wide range of services to a local user or machine, the cloud
could easily be dismissed as just another phrase for the Internet itself. And to users,
there does not seem to be much distinction.
While there is no formal agreement on what the cloud is, a reasonable definition is
that it is the deployment of and interaction between three reliability strategies:
Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS), and Infrastructure-as-aService (IaaS). These technologies are enabled by virtual machines and containers,
which have already been covered in this chapter. You can envision cloud computing
as the stack shown in Figure 13-2.
Figure 13-2. Cloud computing stack
The cloud is not a revolution in terms of the kinds of technology that are deployed
(websites, email, databases, etc.). Rather, it is an evolution of who does the deploying,
at what scale, and with what level of redundancy. For example, if 10 years ago you
wanted an email server for your group, it often meant you would go over to the old
Deploying to the Cloud
www.it-ebooks.info
|
325
server in your closet, install the email server software, and then assign yourself and
your friends any email addresses you liked. This worked just fine until your hard
drive spun itself into oblivion, vermin ate through your ethernet cable, or your
Internet service provider suddenly decided to start blocking port 25.1 Alternatively,
suppose that you were running a website from your home and it suddenly became
extraordinarily popular.2 Your brand new DSL connection would probably not have
been able to handle the load. This would invariably lead to the site being down until it
was no longer popular. This used to happen so frequently that the phenomenon
earned the name the Slashdot effect, after the news website whose references caused
many pages to go down temporarily.
The cloud solves these problems by offloading services to a larger, more stable, and
better connected third party, such as Google or Amazon or Rackspace. It also allows
you to scale your services up or down as needed. Cloud service providers make it
very easy to provision new machines, bandwidth, or storage space as you need them.
They also make it easy to remove such resources once you are finished with them.
Starting at the bottom of the stack, Infrastructure-as-a-Service is where you rent
some or all of the physical parts of a large-scale computer system: hard drives,
servers, networking devices, electricity, an amazing Internet connection, and the roof
above all of this expensive machinery. You do not get an operating system, but you
are able to configure the kind of system you want temporarily. This is great to have if
you want to have a single machine to experiment with, 10 to do a trial run, 100 to do
a full release with, and then finally scale back to 50 when you realize you purchased
too many machines. IaaS maximizes flexibility and minimizes risk.
In the middle of the cloud stack live the Platform-as-a-Service vendors. In the PaaS
model, a developer will write an application and deploy it. The PaaS typically offers a
common way to set up and write applications, making it easy to do so once you have
adopted its platform. PaaS applications are often run on IaaS machinery. Examples of
PaaS include Google App Engine (GAE) and Heroku.
At the top level of the stack are user-facing Software-as-a-Service tools. These are
how most people interact with the cloud on a daily basis. Almost always, these tools
are websites that are either available publicly or to a limited set of people. The “soft‐
ware” here refers to the fact that a website is code-developed as a service-oriented
application, distinct from the hardware that runs the code. The classic example of
SaaS is Gmail, though in truth anything that involves an active user doing more than
looking at websites could be considered SaaS. Content generation sites such as Word‐
Press blogs represent this well.
1 Author’s note: all of these have happened to me.
2 This has not.
326
|
Chapter 13: Deploying Software
www.it-ebooks.info
Cloud service providers (Google, Amazon, Rackspace) tend to supply their customers
with IaaS, PaaS, and SaaS in whatever mix they believe best suits customers’ needs.
The lines between these three categories are often blurred. A company that sets out to
sell software might also end up selling hard disk space because it finds that its users
want more storage. A business that just wants to sell time on processors for virtual
machines will typically also end up adding a web interface to manage those VMs.
The power of the cloud comes from the realization that you can be more efficient if
you can offload at least part of the minutiae of managing your services to someone
else who specializes at it. How you deploy to a particular platform depends on your
needs and the supplier. Every company has its own guides and documentation. Once
you think you know what you want, it can help to look for guides and to shop around
online before committing to a particular vendor.
In the physical sciences, the cloud is most often utilized when you or your group have
outgrown the resources that you currently have available. Rather than buying and set‐
ting up expensive new machinery, you can rent the resources you need, when you
need them, for as long as you need them. Since the price point is so much lower than
buying your own in-house capabilities, the cloud effectively brings large-scale com‐
puting services to everyone.
That said, the cloud is centered around a high-throughput computing model. The
next section covers deploying high-performance computing applications, which tend
to be more common in computational physics.
Deploying to Supercomputers
Supercomputers are notoriously finicky devices. They are large computer engineering
experiments. Unlike the cloud, a supercomputer was not built to serve you or your
needs. If you reach the point where you need to have or you do have access to a
supercomputer for your work, it will be clear that supercomputers are shared resour‐
ces. You will almost never need to access the whole machine. If you do, that just
means that you’ll have to wait your turn in line even longer.
Deployment in a supercomputing environment is embodied by the following three
features:
• Uniform processors and memory across nodes
• A parallel filesystem
• A scheduler
As was touched on in Chapter 12, having the same processor and the same amount of
memory available everywhere in a supercomputer means that the code that you write
for one part of the machine is executable on all other parts. This is great. However,
Deploying to Supercomputers
www.it-ebooks.info
|
327
you do not have direct access to the supercomputer’s compute nodes. First, you must
sign in to the login node to access the machine at all. From the login node, you may
then submit jobs to be executed by the scheduler. Typically, supercomputing environ‐
ments require you to compile custom versions of your code in order for that code to
be able to run. Furthermore, every machine is its own special work of art. The cus‐
tomizations that you make for one supercomputer do not necessarily apply to
another machine. These customizations often break when going from one generation
of machines to the next.
Supercomputers, as a shared resource, also have a parallel filesystem. This is because
many people and many processes will attempt to access files simultaneously. As a
user, this will look and feel much like a normal filesystem that you have on your lap‐
top or desktop or flash drive. However, it will perform much differently. On a parallel
filesystem, every file you access incurs a certain overhead. For a large number of files
this overhead becomes unbearable, because each file has to be checked individually.
For example, on a laptop, executing the ls command in a directory with a hundred
thousand files might take 1–3 seconds. On a parallel filesystem, this same operation
could take half an hour or longer. Limiting the number of files that you have to access
is the key to being successful. File metadata is just that much slower on these systems.
There are usually tricks to make commands faster, but these are not necessarily well
known. For example, using ls --color=never will sometimes bring the listing run‐
time back down to sane levels, but not if you use the -l option or any other possible
slowdowns.
Lastly, to actually run code on a supercomputer you need to go through the scheduler.
Common schedulers include TORQUE and Grid Engine. This program is responsi‐
ble for keeping track of how much total time each user has allocated and how much
has been used, and for determining which jobs get to be run and which ones remain
in the queue until the next job has finished. The point of the scheduler is to keep the
usage of the machine fair. As with most “fair” systems, it can leave its constituents
frustrated and annoyed. Almost all schedulers have a time limit. If a job exceeds the
time limit, the scheduler will abruptly kill the job. Time limits typically range from
three days to a week. Smaller and shorter jobs will typically move through the queue
more quickly than larger and longer jobs. However, since you are on a supercomputer
you likely need the greater resources. There is a balance that has to be struck.
Since every supercomputer is special, you will need to consult the machine-specific
documentation for yours to deploy to it. If this does not work, please consult your
friendly and overworked system administrator to help figure out where you went
wrong. Note that kindness serves you well in these scenarios. When nothing is work‐
ing and everything has gone wrong, five minutes of social engineering can easily save
a month’s worth of computer engineering. Being nice helps.
328
|
Chapter 13: Deploying Software
www.it-ebooks.info
Deployment Wrap-up
We have now seen a wide range of strategies for deploying software. You should be
familiar with these main points:
• Deployment is ultimately about your users.
• You are your first user.
• Source-based package managers are good for high-level languages, like Python.
• Binary package managers are good for compiled languages because the users do
not need to have compilers themselves.
• Virtual machines package up the entire environment, including the operating
system.
• Containers are much lighter-weight than virtual machines, but have the same
advantages.
• Cloud computing is useful when reliability is needed.
• Supercomputers can be frustrating to interact with because they are shared
resources, while simultaneously being computer engineering experiments. How‐
ever, the benefits outweigh the costs when your problem demands such a device.
Now that you know how to write and deploy software, it is time to learn the best
practices of computational science.
Deployment Wrap-up
www.it-ebooks.info
|
329
www.it-ebooks.info
PART III
Getting It Right
www.it-ebooks.info
www.it-ebooks.info
CHAPTER 14
Building Pipelines and Software
The most effective analysis and software tools are reused and shared often. Physics
software, libraries, and analysis pipelines are no exception. Indeed, this book is inten‐
tionally titled Effective Computation in Physics. Effective computational physicists
streamline repetitive work and liberate themselves from the mundane. They are also
able to build upon the work of others by installing and using external libraries. This
chapter will cover both strategies, since they are intricately related by an emphasis on
automation and by a versatile tool called make.
By the end of this chapter, you should be able to get started with the following tasks:
• Automating complex workflows and analysis pipelines
• Configuring and installing external libraries
• Creating build systems for your own software
By automating the tedious steps, you are much more likely to encourage colleagues to
extend your work and to, therefore, positively impact the future of your field. This
chapter will help you to assist your peers in building, installing, and reproducing your
analysis pipeline by automating the various steps of building that pipeline and your
software:
1. Configuration
2. Compilation
3. Linking
4. Installation
The first step, configuration, detects platform-dependent variables and user-specified
options. It then uses those to customize later steps in the process. For example, an
333
www.it-ebooks.info
analysis pipeline might run a Python program on a data file to create a particular plot.
However, the location of the data file may vary from analysis to analysis or from com‐
puter to computer. Therefore, the configuration step occurs before the program is
executed. The configuration step seeks out the proper data path by querying the envi‐
ronment and the user to configure the execution accordingly.
The next two steps, compilation and linking, are only necessary you’re when building
software written in a compiled language. The compilation step relies on a compiler to
convert source code into a machine-readable binary format. The linking step attaches
that binary-formatted library or executable to other libraries on which it may depend.
These two steps prepare the software for installation.
In Chapter 13, the installation step was addressed in a Python context. This step is
when compiled libraries, executables, data, or other files are placed in an accessible
place in the filesystem. This is all followed by the execution step, when the user
actually runs the analysis pipeline or software.
This chapter will address each of these five steps in the context of the make utility and
its makefiles. These are ubiquitous in the computational physical sciences and are
integral to automating nearly any computational process.
make
make is a command-line utility that determines which elements of a process pipeline
need to be executed, and then executes them. It can be used to automate any process
with a tree of dependencies—that is, any process that builds files based on others.
Our case study in this chapter will be Maria Goeppert Mayer, who won the 1963
Nobel Prize in Physics for having theorized the nuclear shell model. She was a
remarkably effective scientist. Thus, we can be fairly certain that if she were still doing
science these days, she would definitely use make to automate her work. Any time a
new experiment generated new data or her theoretical model was tweaked, she might
want to update the figures in her draft theory paper.
make can automate this. Let us imagine that Professor Mayer is working on a paper
describing a new theory and that one day, she receives additional data to support her
theory. She would like to update her paper accordingly.
334
|
Chapter 14: Building Pipelines and Software
www.it-ebooks.info
LaTeX and Its Dependencies
In the following examples, Prof. Mayer will be writing her papers in LaTeX. This pro‐
gram will be covered in great depth in Chapter 20, but some basic information about
it will be helpful for following along in this section. In particular, LaTeX is a program
that can be used to combine plain-text files and image files to create documents. Fun‐
damentally, an author creates some number of .tex files, image files, and others. The
LaTeX program converts those into a .dvi file, which is much like a .pdf.
To achieve this, first and foremost, she needs to add the new data to the list of data
files that she analyzes. Next, the data analysis program must be rerun using all of the
old data plus the new data. The results of the data analysis affect one of the figures in
her paper (“Figure 4: Photon-Photon Interactions”), so just adding the data and
rerunning the analysis won’t be enough; she also needs to rerun the plotting program
that generates the image for this figure. Of course, when any changes to the figures
or .tex files are made, the actual paper must be rebuilt using LaTeX (see Chapter 20).
This should sound familiar. It is the vast rabbit hole of tasks down which many physi‐
cists lose uncountably many research hours. These things that need to be done, the
tasks, are the nodes of a complex dependency tree. The file dependency tree for Prof.
Mayer’s project might resemble that in Figure 14-1. When a new data file is intro‐
duced, like 1948-12-21.h5, many of the files that depend on it must be regenerated.
A simple bash script like the ones discussed in Chapter 1 could be written to execute
every command on the tree, regenerating all of the figures and the paper any time it is
called. However, since the new data only affects one figure, not all of the figures need
be rebuilt. Such a bash script would spend a lot of time regenerating figures unneces‐
sarily: replotting the rest of the figures would be a waste of time, since they are
already up to date.
make
www.it-ebooks.info
|
335
Figure 14-1. Mayer dependency tree
The make utility is superior. It can be used to automate every step of this situation
more efficiently, because it keeps track of how things depend on one another and
detects which pieces are not up to date. Given the file dependency tree and a descrip‐
tion of the processes that compile each file based on the others, make can execute the
necessary processes in the appropriate order.
Because it detects which files in the dependency tree have changed,
make executes only the necessary processes, and no more. This saves
time, especially when some actions take a long time to execute but
are not always necessary.
Can you tell what processes would need to be reexecuted if a new file in the raw_data
directory (1948-12-21.h5) were introduced? Try drawing a path up the branches of
the tree to the top. Which commands do you pass on the way?
336
| Chapter 14: Building Pipelines and Software
www.it-ebooks.info
Some platforms, like Windows and Mac OS X, do not have make
enabled in the Terminal by default. Try opening a terminal and typ‐
ing which make. If the make utility is available, this command will
output the path to the executable. If it is not available, an error will
be emitted: “make: command not found.” In the latter case, return
to the Preface for instructions on how to enable make on your
platform.
When a new data file is added, make can determine what analysis files, figures, and
documents are affected. It can then execute the processes to update them. In this way,
it can automatically rerun Prof. Mayer’s data analysis, regenerate appropriate plot,
and rebuild the paper accordingly.
It sounds glorious—too good be true, really. Prof. Mayer would like to try running
make.
Running make
make can be run on the command line with the following syntax:
make [ -f makefile ] [ options ] ... [ targets ] ...
It looks like the make command can be run without any arguments. So, in some direc‐
tory where Prof. Mayer holds these interdependent files, she can try typing this magi‐
cal make command:
~/shell_model $ make
make: *** No targets specified and no makefile found.
Stop.
Uh oh, it looks like it may not be magic after all. But what is a makefile?
Makefiles
A makefile is just a plain-text file that obeys certain formatting rules. Its purpose is to
supply the make utility with a full definition of the dependency tree describing the
relationships between files and tasks. The makefile also describes the steps for updat‐
ing each file based on the others (i.e., which commands must be executed to update
one of the nodes).
If the make command is run with no arguments, then the make utility seeks a file
called Makefile in the current directory to fulfill this purpose. The error response
occurs because Prof. Mayer has not yet created a Makefile in the directory where she
holds her analysis.
make
www.it-ebooks.info
|
337
If the makefile has any name other than Makefile, its name must be
provided explicitly. The -f flag indicates the location of that file to
the make utility. Makefiles with names other than Makefile are typi‐
cally only necessary if more than one makefile must exist in a sin‐
gle directory. By convention, makefiles not called Makefile end in
the .mk extension.
This section will discuss how to write makefiles by hand. Such makefiles can be used
to automate simple analysis and software pipelines. Prof. Mayer will create one to
update the plots in her paper based on new data.
Targets
First and foremost, the makefile defines targets. Targets are the nodes of the depend‐
ency tree. They are typically the files that are being updated. The makefile is made up
mostly of a series of target-prerequisite-action maps defined in the following syntax:
target : prerequisites
action
A colon separates the target name from the list of prerequisites.
Note that the action must be preceded by a single tab character.
The analyzed .dat files depend on the raw .h5 files in the raw_data directory. They
also depend on the bash scripts that churn through the .h5 files to convert them into
useful .dat files. Therefore, the photon_photon.dat target depends on two prerequi‐
sites, the set of ./raw_data/*.h5 files and the photon_analysis.sh shell script.
Let us imagine the shell script is quite clever, having been written by Prof. Mayer her‐
self. It has been written to generically model various interactions and accepts argu‐
ments at runtime that modify its behavior. One of the arguments it accepts is the
number of photons involved in the interaction. Since the photon_photon.dat file
describes the two-photon interaction, the shell script can be modified with the special
flag -n=2 indicating the number of photons. The following definition in a makefile
sets up the target with its prerequisites and passes in this argument:
# Building the Shell Model Paper
photon_photon.dat : photon_analysis.sh ./raw_data/*.h5
./photon_analysis.sh -n=2 > photon_photon.dat
The target file to be created or updated is the photon_photon.dat file. The prereq‐
uisites (the files on which it depends) are the shell script and the .h5 files.
This command is the action that must be taken to update the target using the
prerequisites.
338
|
Chapter 14: Building Pipelines and Software
www.it-ebooks.info
In this example, the first line is a comment describing the file. That’s just good prac‐
tice and does not affect the behavior of the makefile. The second line describes the
target and the prerequisites, and the third line describes the action that must be taken
to update photon_photon.dat in the event that make detects any changes to either of
its prerequisites.
If this source code is saved in a file called Makefile, then it will be found when make is
executed.
Exercise: Create a Makefile
1. In the make directory of the files associated with this book,
create an empty file called Makefile.
2. Add the photon_photon.dat target as described.
3. Save the file.
Now that the makefile defines a target, it can be used to update that target. To build or
update a target file using make, you must call it with the name of the target defined in
the makefile. In this case, if make photon_photon.dat is called, then make will:
1. Check the status of the prerequisites and photon_photon.dat.
2. If their timestamps do not match, it will execute the action.
3. However, if their timestamps do match, nothing will happen, because everything
is up to date already.
The makefile is built up of many such target-prerequisite-action maps. The full
dependency tree can accordingly be built from a set of these directives. The next node
Prof. Mayer might define, for example, is the one that rebuilds Figure 4 any time the
photon_photon.dat file is changed. That figure is generated by the plot_response.py
Python script, so any changes to that script should also trigger a rebuild of fig4.svg.
The makefile grows accordingly as each target definition is added. The new version
might look like this:
# Building the Shell Model Paper
photon_photon.dat : photon_analysis.sh ./raw_data/*.h5
./photon_analysis.sh -n=2 > photon_photon.dat
fig4.svg : photon_photon.dat plot_response.py
python plot_dat.py --input=photon_photon.dat --output=fig4.svg
A new target, fig4.svg, is defined.
make
www.it-ebooks.info
|
339
The fig4.svg file depends on photon_photon.dat as a prerequisite (as well as a
Python script, plot_response.py).
The action to build fig4.svg executes the Python script with specific options.
Since the figure relies on photon_photon.dat as a prerequisite, it also, in turn, relies on
prerequisites of photon_photon.dat. In this way, the dependency tree is made. So,
when make fig4.svg is called, make ensures that all the prerequisites of its prerequi‐
sites are up to date.
Exercise: Add Additional Targets
1. Open the Makefile created in the previous exercise.
2. Add the fig4.svg target as above.
3. Can you tell, from Figure 14-1, how to add other targets? Try
adding some.
4. Save the file.
The final paper depends on all of the figures and the .tex files. So, any time a figure or
the .tex files change, the LaTeX commands must be reissued. The LaTeX program will
be covered in much greater detail in Chapter 20. At that time, you may combine your
knowledge of make with your knowledge of LaTeX to determine what targets should
be included in a makefile for generating a LaTeX-based document.
Special Targets
The first target in a file is usually run by default. That target is the one that is built
when make is called with no arguments. Often, the desired default behavior is to
update everything. An “all” target is a common convention for this. Note that the tar‐
get name does not have to be identical to the filename. It can be any word that is con‐
venient. The “all” target simply needs to depend on all other top-level targets.
In the case of Prof. Mayer’s paper, the all target might be defined using the wildcard
character (*):
# Building the Shell Model Paper
all: figure*.svg *.dat *.tex *.pdf
photon_photon.dat : photon_analysis.sh ./raw_data/*.h5
./photon_analysis.sh -n=2 > photon_photon.dat
fig4.svg : photon_photon.dat
python plot_response.py --input=photon_photon.dat --output=fig4.svg
340
|
Chapter 14: Building Pipelines and Software
www.it-ebooks.info
...
Note how the all target does not define an action. It just collects prerequisites.
The all target tells make to do exactly what is needed. That is, when this target is
called (with make or make all), make ensures that all prerequisites are up to date, but
performs no final action.
Exercise: Create a Special Target
Another common special target is clean. This target is typically
used to delete generated files in order to trigger a fresh reupdate of
everything.
1. Open the Makefile you have been working with.
2. Create a “clean” target.
3. What are the appropriate prerequisites? Are there any?
4. What is the appropriate command to delete the auxiliary files
created by LaTeX?
Now that she knows how to create a makefile, Prof. Mayer can use it to manage
dependencies for the entire process of building her paper from the raw data. This is a
common use for makefiles and facilitates many parts of analysis, visualization, and
publication. Another common use for makefiles is configuring, compiling, building,
linking, and installing software libraries. The next section will cover many aspects of
this kind of makefile.
Building and Installing Software
Python is called a compiled language because it does not need to be compiled. That is,
Python is precompiled. However, that compilation step is not handled so nicely by all
programming languages. C, C++ , Fortran, Java, and many others require multiple
stages of building before they are ready to run. We said in the introduction to this
chapter that these stages were:
1. Configuration
2. Compilation
3. Linking
4. Installation
Building and Installing Software
www.it-ebooks.info
|
341
From a user’s perspective, this maps onto the following set of commands for instal‐
ling software from source :
~
~
~
~
$
$
$
$
.configure [options]
make [options]
make test
[sudo] make install
The configuration step may be called with a different command (i.e., ccmake or
scons). This step creates a makefile based on user options and system character‐
istics.
The build step compiles the source code into binary format and incorporates file
path links to the libraries on which it depends.
Before installing, it is wise to execute the test target, if available, to ensure that
the library has built successfully on your platform.
The installation step will copy the build files into an appropriate location on your
computer. Often, this may be a location specified by the user in the configuration
step. If the install directory requires super-user permissions, it may be necessary
to prepend this command with sudo, which changes your role during this action
to the super-user role.
For installation to succeed, each of these steps requires commands, flags, and custom‐
ization specific to the computer platform, the user, and the environment. That is, the
“action” defined by the makefile may involve commands that should be executed dif‐
ferently on different platforms or for different users.
For example, a compilation step can only use the compiler available on the computer.
Compilation is done with a command of the form:
compiler [options] [-l linked libraries]
For C++ programs, one user may use g++ while another uses clang and a third uses
gcc. The appropriate command will be different for each user. The makefile, therefore,
must be configured to detect which compiler exists on the machine and to adjust the
“action” accordingly. That is, in the action for the compilation step, the compiler com‐
mand and its arguments are not known a priori. Configuration, Compilation, Link‐
ing, and Installation depend on the computer environment, user preferences, and
many other factors.
For this reason, when you are building and installing software libraries, makefiles can
become very complex. However, at their core, their operation is no different than for
simple analysis pipeline applications like the one in the previous section. As the
dependency tree grows, more targets are added, and the actions become more com‐
342
|
Chapter 14: Building Pipelines and Software
www.it-ebooks.info
plex or system-dependent, more advanced makefile syntax and platform-specific con‐
figuration becomes necessary. Automation is the only solution that scales.
Configuration of the Makefile
It would be tedious and error-prone to write a custom makefile appropriate for each
conceivable platform-dependent combination of variables. To avoid this tedium, the
most effective researchers and software developers choose to utilize tools that auto‐
mate that configuration. These tools:
• Detect platform and architecture characteristics
• Detect environment variables
• Detect available commands, compilers, and libraries
• Accept user input
• Produce a customized makefile
In this way, configuration tools (a.k.a. “build systems”) address all aspects of the
project that may be variable in the build phase. Additionally, they enable the devel‐
oper to supply sensible default values for each parameter, which can be coupled with
methods to override those defaults when necessary.
Why Not Write Your Own Installation Makefile?
Writing your own makefile from scratch can be time-consuming and error-prone.
Furthermore, as a software project is adopted by a diversity of users and scales to
include dependencies on external libraries, generating an appropriate array of make‐
files for each use case becomes untenable. So, the makefile should be generated by a
sophisticated build system, which will enable it to be much more flexible across plat‐
forms than would otherwise be possible.
Some common build system automation tools in scientific computing include:
• CMake
• Autotools (Automake + Autoconf)
• SCons
Rather than demonstrating the syntax of each of these tools, the following sections
will touch on shared concepts among them and the configurations with which they
assist.
First among these, most build systems enable customization based on the computer
system platform and architecture.
Building and Installing Software
www.it-ebooks.info
|
343
Platform configuration
Users have various computer platforms with similarly various architectures. Most
software must be built differently on each. Even the very simplest things can vary
across platforms. For example, libraries have different filename extensions on each
platform (perhaps libSuperPhysics.dll on Windows, libSuperPhysics.so on Linux, and
libSuperPhysics.dyld on Unix). Thus, to define the makefile targets, prerequisites, and
actions, the configuration system must detect the platform. The operating system
may be any of the following, and more:
• Linux
• Unix
• Windows
• Mobile
• Embedded
Additionally, different computer architectures store numbers differently. For exam‐
ple, on 32-bit machines, the processors store integers in 32-bit-sized memory blocks.
However, on a 64-bit machine, an integer is stored with higher precision (64 bits).
Differences like this require that the configuration system detect how the current
architecture stores numbers. These specifications often must be included in the com‐
pilation command.
Beyond the platform and architecture customizations that must be made, the system
environment, what libraries are installed, the locations of those libraries, and other
user options also affect the build.
System and user configuration
Most importantly, different computers are controlled by different users. Thus, build
systems must accommodate users who make different choices with regard to issues
such as:
• What compiler to use
• What versions of libraries to install
• Where to install those libraries
• What directories to include in their PATH and similar environment variables
• What optional parts of the project to build
• What compiler flags to use (debugging build, optimized build, etc.)
The aspects of various systems that cause the most trouble when you’re installing a
new library are the environment variables (such as PATH) and their relationship to the
344
|
Chapter 14: Building Pipelines and Software
www.it-ebooks.info
locations of installed libraries. In particular, when this relationship is not precise and
accurate, the build system can struggle to find and link dependencies.
Dependency configuration
When one piece of software depends on the functionality of another piece of soft‐
ware, the second is called a dependency. For example, if the SuperPhysics library
relies on the EssentialPhysics library and the ExtraPhysics library, then they are
its dependencies. Before attempting to install the SuperPhysics library, you must
install EssentialPhysics and ExtraPhysics.
The build can fail in either of these cases:
• The build system cannot locate a dependency library.
• The available library is not the correct version.
The build system seeks the libraries listed in the PATH, LD_LIBRARY_PATH, and similar
environment variables. Thus, the most common problems in building software arise
when too many or not enough dependency libraries appear in the directories target‐
ted by the environment.
When too many versions of the ExtraPhysics library are found, for example, the
wrong version of the library might be linked and an error may occur. At the other
extreme, if no EssentialPhysics library is found, the build will certainly fail. To fix
these problems, be sure all dependencies are appropriately installed.
Once all dependencies, environment variables, user options, and other configurations
are complete, a makefile or installation script is generated by the build system. The
first action it conducts is the compilation step.
Compilation
Now that the makefile is configured, it can be used to compile the source code. The
commands in the makefile for a software build will be mostly compiler commands.
Without getting into too much detail, compilers are programs that turn source code
into a machine-readable binary format.
The build system, by convention, likely generated a makefile with a default target
designed to compile all of the source code into a local directory. So, with a simple
make command, the compiled files are generated and typically saved (by the makefile)
in a temporary directory as a test before actual installation. Additionally, once com‐
piled, the build can usually be tested with make test.
If the tests pass, the build system can also assist with the next step: installation.
Building and Installing Software
www.it-ebooks.info
|
345
Installation
As mentioned in Chapter 13, the key to attracting users to your project is making it
installable.
On Windows, this means creating a Setup.exe file. With Python, it means implement‐
ing a setup.py or other distribution utility. For other source code on Unix systems,
this means generating a makefile with an install target so that make install can be
called.
Why not just write a simple script to perform the installation?
The user may eventually want to upgrade or even uninstall your program, fantastic as
it may be. By tradition, the installation program is usually created by the application
developer, but the uninstall program is usually the responsibility of the operating sys‐
tem. On Windows, this is handled by the Add/Remove Programs tool. On Unix, this
is the responsibility of the package manager. This means the installation program
needs special platform-dependent capabilities, which are usually taken care of by the
build system.
For example, on Linux, make install is not used when creating packages. Instead,
make DESTDIR= install installs the package to a fake root direc‐
tory. Then, a package is created from the fake root directory, and uninstallation is
possible because a manifest is generated from the result.
The build system will have created this target automatically. If the installation loca‐
tion chosen in the configuration step is a restricted directory, then you must execute
the make install command with sudo in order to act as the superuser:
sudo make install
At that point, the software should be successfully installed.
Building Software and Pipelines Wrap-up
At the end of this chapter, you should now feel comfortable with automating pipe‐
lines and building software using makefiles. You should also now be familiar with the
steps involved in building a non-Python software library from source:
1. Configuration
2. Compilation
3. Linking
4. Installation
346
|
Chapter 14: Building Pipelines and Software
www.it-ebooks.info
Additionally, with the knowledge of how these steps and the makefile relate to the
platform, environment, user, and dependencies, you should feel prepared to under‐
stand a wide array of installation issues more fully.
If this chapter has succeeded in its purpose, you may be interested in researching
automated build systems (e.g., CMake, autotools, and SCons) more fully in order to
implement a build system for your own software project.
Building Software and Pipelines Wrap-up
www.it-ebooks.info
|
347
www.it-ebooks.info
CHAPTER 15
Local Version Control
In science, reproducibility is paramount. A fundamental principle of science, repro‐
ducibility is the requirement that experimental results from independent laboratories
should be commensurate. In scientific computation simulations, data munging and
analysis pipelines are experimental analogs. To ensure that results are repeatable, it
must be possible to unwind code and analysis to previous versions, and to replicate
plots. The most essential requirement is that all previous versions of the code, data,
and provenance metadata must be robustly and retrievably archived. The best prac‐
tice in scientific computing is called version control.
Rather than inventing a system of indexed directories holding full versions of your
code from each day in the lab, the best practice in software development is to use a
version control system that automates archiving and retrieval of text documents such
as source code.
This chapter will explain:
• What version control is
• How to use it for managing files on your computer
• And how to use it for managing files in a collaboration
First up, this chapter will discuss what version control is and how it fits into the
reproducible workflow of an effective researcher in the physical sciences.
What Is Version Control?
Very briefly, version control is a way to:
• Back up changing files
349
www.it-ebooks.info
• Store and access an annotated history
• And manage merging of changes between different change sets
There are many tools to automate version control. Wikipedia provides both a nice
vocabulary list and a fairly complete table of some popular version control systems
and their equivalent commands.
Fundamentally, version control provides capabilities similar to those that a laboratory
notebook historically has provided in the workflow of the experimental scientist. In
this way, it can be considered a sort of laboratory notebook for scientists who use
computation.
The Lab Notebook of Computational Physics
The Wu Experiment, one of the foundational experiments in nuclear physics, demon‐
strated a violation of the Law of Conservation of Parity. Dr. Chien Shiung Wu, a Chi‐
nese physicist at Columbia University, took great pains to construct a reproducible
experiment. Toward this goal, she even moved her experimental setup to the National
Bureau of Standards Headquarters in Maryland while her colleagues reproduced her
work back at her home institution. This work led to a Nobel Prize for her theorist
colleagues (and the inaugural Wolf Prize for Madame Wu).
A modern Dr. Wu (we will call her Dr. Nu) might simulate this physics experiment
with software. Dr. Nu is a persistent researcher, so she works on this important soft‐
ware project day and night until it accurately represents the theory of her colleagues.
When she finishes one day before dinnertime, she plots her results and is relieved to
be ready to submit them to a journal in the morning.
An unconscionable number of months later, she receives the journal’s review of her
work. It is a glowing review, but asks that the results be presented (in all plots, equa‐
tions, and analysis) with the inconvenient positive-current convention for charged
particle currents.
Since so many months have passed since the article was first submitted, the code has
changed significantly and plots are now rendered differently in preparation for a new
journal submission. For an alarming percentage of physicists, this would be a minor
disaster involving weeks of sorting through the files to recall the changes that have
happened in the last year (Merali 2010). It might even be impossible to roll back the
code to its previous state.
However, Dr. Nu breathes a sigh of relief. She has been using version control. With
version control, she can execute a single command to examine the record of her
actions over the last several months. Her version control system has kept, essentially,
a laboratory notebook of her software development history. When satisfied with her
understanding of the logs, she can execute another simple command to revert the
350
|
Chapter 15: Local Version Control
www.it-ebooks.info
code to the state it was in the night she made the journal-ready plots. Before after‐
noon tea, Dr. Nu makes the simple change of sign convention in the plot, reruns her
plotting script, and submits the revisions. Once tea is over, she can bring the reposi‐
tory back up to date and get back to work, as if nothing had changed.
What would happen if you received a review asking for a convention change to
results you completed a year ago? If that’s a scenario that is simply too terrible to
imagine, you are not alone. This chapter will explain how Dr. Nu reached this envia‐
ble position, so that you can do the same.
We will start by explaining the types of version control available to a scientist. Then,
the rest of the chapter will explain version control concepts in the context of one ver‐
sion control system in particular.
Version Control Tool Types
Version control systems come in two fundamentally different categories. More
modern version control systems are “distributed” rather than “centralized.” Central‐
ized systems designate a single (central) definitive location for a repository’s source,
while distributed version control systems treat all (distributed) locations as equals.
Some common version control systems in each category include:
• Centralized
— Concurrent Versions System (cvs)
— Subversion (svn)
— Perforce (p4)
• Distributed
— Decentralized CVS (dcvs)
— mercurial (hg)
— bazaar (bzr)
— Git (git)
Recently, distributed version control systems have become more popular. They are
better suited to collaborative projects, since their capabilities for managing and merg‐
ing together changes from multiple developers are more powerful and user-friendly.
Choosing the appropriate option from among these should depend largely on the
expertise of your colleagues and their collaboration style. Due to its popularity, flexi‐
bility, and collaborative nature, we will demonstrate version control concepts using
the Git tool. Git, written by Linux creator Linus Torvalds, is an example of a dis‐
tributed version control system. It has a somewhat steep learning curve, so the sooner
we can get started, the better.
What Is Version Control?
www.it-ebooks.info
|
351
Getting Started with Git
Git needs to be installed and configured before it can be used to control the versions
of a set of files. When Dr. Nu was first getting started with her simulation, she knew
she should keep versions of everything, just like in a laboratory notebook. So, she
decided to install Git on the computer where she writes her code and conducts her
analysis.
Installing Git
The first step for using Git is installing it. Dr. Nu wasn’t sure if she already had Git
installed on her computer, so to check she used the which command we met in Chap‐
ter 1.
To determine whether Git is already installed and to find help
using it, try executing which git in the terminal. Does it respond
with a path to the program, or does it return nothing?
If which git returns no executable path, then Git is not installed. To install it, she’ll
need to:
1. Go to the Git website.
2. Follow the instructions for her platform.
On some platforms, the default version of Git is quite old (developers call that stable).
Unfortunately, it may not have all the features of an up-to-date version. So, even if Git
is installed, you may consider updating it anyway.
Once Git has been installed (or updated), it can be used. But how?
Getting Help (git --help)
The first thing Dr. Nu likes to know about any tool is how to get help. From the com‐
mand line, she types:
~ $ man git
The manual entry for the Git version control system appears before her, rendered in
less. She may scroll through it using the arrow keys, or she can search for keywords
by typing / followed by the search term. Dr. Nu is interested in help, so she types /
help and then hits Enter.
By doing this, Dr. Nu finds that the syntax for getting help with Git is git --help.
352
|
Chapter 15: Local Version Control
www.it-ebooks.info
Manual (man) pages are rendered in less. To exit the man page,
therefore, type the letter q and hit Enter.
To try this help syntax, Dr. Nu exits the man page and tests what happens when she
types:
~ $ git --help
Excellent! Git returns, to the terminal, a list of commands it is able to help with, as
well as their descriptions:
usage: git [--version] [--exec-path[=]] [--html-path]
[-p|--paginate|--no-pager] [--no-replace-objects]
[--bare] [--git-dir=] [--work-tree=]
[-c name=value] [--help]
[]
The most commonly used git commands are:
add
Add file contents to the index
bisect
Find by binary search the change that introduced a bug
branch
List, create, or delete branches
checkout
Checkout a branch or paths to the working tree
clone
Clone a repository into a new directory
commit
Record changes to the repository
diff
Show changes between commits, commit and working tree, etc
fetch
Download objects and refs from another repository
grep
Print lines matching a pattern
init
Create an empty git repository or reinitialize an existing one
log
Show commit logs
merge
Join two or more development histories together
mv
Move or rename a file, a directory, or a symlink
pull
Fetch from and merge with another repository or a local branch
push
Update remote refs along with associated objects
rebase
Forward-port local commits to the updated upstream head
reset
Reset current HEAD to the specified state
rm
Remove files from the working tree and from the index
show
Show various types of objects
status
Show the working tree status
tag
Create, list, delete or verify a tag object signed with GPG
See 'git help ' for more information on a specific command.
The help command even has a metacomment about how to get more specific
information about a particular command. Based on this, how would you learn
more about the git init command?
Getting Started with Git
www.it-ebooks.info
|
353
That’s a lot of commands, but it isn’t everything Git can do. This chapter will cover
many of these and a few more. The first command we will show will complete the setup process: git config.
Control the Behavior of Git (git config)
To complete the setup process for Git, a configuration step is necessary. Knowing
your name, email address, favorite text editor, and other data will help Git to behave
optimally and to correctly provide attribution for the work that you do with your
files.
Dr. Nu knows that version control provides an exceptional attribution service to sci‐
entists collaborating on code. When a change is made to code under version control,
it must be attributed to the author. To ensure that she is appropriately attributed for
her excellent work and held accountable for any code bugs, Dr. Nu configures her
instance of Git thus:
~ $ git config --global user.name "Nouveau Nu"
~ $ git config --global user.email "nu@university.edu"
~ $ git config --global core.editor "nano"
Changes made by Dr. Nu will be attributed to the name she provides to Git.
Her university email address will also be stored with each logged change.
Git can behave more optimally when it is aware of the author’s preferred text edi‐
tor. Later, we will see how.
Not only do scientists love to see their names in print next to a piece of excellent
work, but authorship metadata is essential to provide provenance and answer the
question, “Where did this work come from?” Indeed, attribution is central to the sci‐
entific process, since accountability is one of the fundamental motivators of scientific
excellence.
Exercise: Configure Git on Your Computer
1. Use the preceding example as a model to inform Git of your
name, email address, and favorite text editor.
2. List your new configuration settings with git config --list.
Now that Git is set up on her system, Dr. Nu is able to use it to manage the versions of
the files stored on her computer.
354
|
Chapter 15: Local Version Control
www.it-ebooks.info
Local Version Control with Git
All researchers in scientific computing have at least one computer full of data, text
files, images, plots, scripts, and other software. Those files often constitute the bulk of
the day-to-day efforts of that researcher. Code and data are often created, manipu‐
lated, dealt with, and stored primarily on a single computer. Controlling versions of
those files involves:
• Creating a repository where those changes are stored
• Adding files to that repository so that they can be tracked
• Taking snapshots of incremental versions, so that they are logged
• Undoing changes
• Redoing them
• Trying new ideas in separate sandboxes
We will talk about all of these tasks in this section. So first, we must learn how to cre‐
ate a repository.
Creating a Local Repository (git init)
Dr. Nu would like to write code that simulates Dr. Chien-Shiung Wu’s landmark
experiment. Writing this code may take years and involve the effort of many graduate
students, and the code will undergo many iterations. To keep track of numerous ver‐
sions of her work without saving numerous copies, Dr. Nu can make a local repository
for it on her computer.
A repository is where the tracked files live and are edited. For each
version of those files, Git records the change set, or “diff ”—the lineby-line differences between the new version and the one before it.
To begin keeping a record of files within a directory, Dr. Nu must enter that directory
and execute the command git init . This creates an empty repository:
~ $ mkdir parity_code
~ $ cd parity_code
~/parity_code $ git init
Initialized empty Git repository in /filespace/people/n/nu/parity_code/.git/
First, she creates the directory where she will do the work.
She navigates to that directory.
Local Version Control with Git
www.it-ebooks.info
|
355
She initializes a repository within that directory.
Git responds positively. An empty repository has been created here.
Because she is a scientist, Dr. Nu is curious about what happened. She can browse the
directory’s hidden files to see what happened here:
~/parity_code $ ls
~/parity_code $ ls -A
.git
~/parity_code $ cd .git && ls- A
HEAD
config
description hooks
info
objects
refs
A simple listing of the directory contents results in nothing. The directory
appears to be empty. Where is the repository?
Curious, Dr. Nu lists all of the contents of the repository.
A hidden directory, .git, is visible.
Navigating into that directory and listing all of its contents reveals the mecha‐
nism by which the repository operates.
With ordinary use of Git, none of those hidden files will ever need to be altered.
However, it’s important to note that the infrastructure for the repository is contained
within this hidden subdirectory (.git) at the top level of your repository.
A whole repository directory can be moved from one location to
another on a filesystem as long as the .git directory inside moves
along with it.
This means that moving the entire repository directory to another location is irrele‐
vant to the behavior of the repository. However, moving files or directories outside of
the repository will move them outside of the space governed by the hidden
infrastructure.
Exercise: Create a Local Repository
1. From the Terminal, create a new directory like Dr. Nu’s and
use git init to make it an empty local repository.
2. Browse the files in the hidden directory and find out what you
can learn in one minute.
356
|
Chapter 15: Local Version Control
www.it-ebooks.info
Now that a repository has been initialized, work can begin in this directory. As work
progresses, files must first be added to the directory.
Staging Files (git add)
Now, Dr. Nu has created a repository directory to start working in. So, she gets to
work creating a “readme” file. This is an important part of the documentation of any
software; it indicates basic information about the project, as well as where to find
more details.
First, she can create an empty file with the touch command:
~/parity_code $ touch readme.rst
Now the file exists in the directory. However, it is not yet being tracked by the reposi‐
tory. For the Git repository to know which files Dr. Nu would like to keep track of,
she must add them to the list of files that the repository knows to watch. This is called
“staging” the files. It is analogous to arranging people on a stage so that they are ready
for a photo to be taken. In this case, we are staging the file so that it can be included
in the upcoming snapshots.
Thus, to make Git aware of the file, she adds it to the repository with the git add
command:
~/parity_code $ git add readme.rst
Exercise: Add a File to a Local Repository
1. Create a readme file within your repository directory.
2. With git add, inform Git that you would like to keep track of
future changes in this file.
Now that something has been added to the repository, the state of the repository has
changed. Often, it is important to be able to check the state. For this, we will need the
git status command.
Checking the Status of Your Local Copy (git status)
The files you’ve created on your machine are your local “working” copy. The reposi‐
tory, as we have already said, stores versions of the files that it is made aware of. To
find out the status of those files, a status command is available:
~/parity_code $ git status
On branch master
Initial commit
Local Version Control with Git
www.it-ebooks.info
|
357
Changes to be committed:
(use "git rm --cached ..." to unstage)
new file:
readme.rst
Check the status of the repository in the current directory.
Git has something called “branches.” We are on the “master” branch by default.
We will talk more about branches later in the chapter.
Git knows that we have not yet used the commit command in this repository.
Git gives us a hint for what to do if we did not really intend to add the readme
file. We can “unstage” it with git rm!
Git reports that there is a new file “to be committed.” We will discuss committing
in the next section.
This result indicates the current difference between the repository records (which, so
far, are empty) and the parity_code directory contents. In this case, the difference is
the existence of this new readme.rst file. Git suggests that these changes are “to be
committed.” This means that now that the file has been added to the watch list, the
scene is set and the repository is ready for a snapshot to be taken. We save snapshots
with the git commit command.
Saving a Snapshot (git commit)
In order to save a snapshot of the current state of the repository, we use the commit
command. This command:
1. Saves the snapshot, officially called a “revision”
2. Gives that snapshot a unique ID number (a revision hash)
3. Names you as the author of the changes
4. Allows you, the author, to add a message
The git commit command conducts the first three of these tasks automatically. How‐
ever, the fourth requires input from the author. When executing the git commit
command, the author must provide a “commit message” describing the changes rep‐
resented by this commit and indicating their purpose. Informative commit messages
will serve you well someday, so make a habit of never committing changes without at
least a full sentence description.
358
|
Chapter 15: Local Version Control
www.it-ebooks.info
Log messages have a lot of power when used well. To this end,
some open source projects even suggest information to include in
all commit messages. This helps the developers to more systemati‐
cally review the history of the repository. For example, the Pandas
project uses various three-letter keys to indicate the type of com‐
mit, such as (ENH)ancement, BUG, and (DOC)umentation.
In the same way that it is wise to often save a document that you are working on, so
too is it wise to save numerous revisions of your code. More frequent commits
increase the granularity of your “undo” button.
Commit often. Good commits are atomic, the smallest change that
remains meaningful. They should not represent more work than
you are willing to lose.
To commit her work, Dr. Nu simply types git commit into the command line. Git
responds by opening up an instance of the nano text editor, where she can add text to
a document recording the change. She does so by adding a message: “This is my first
commit. I have added a readme file.” When she saves the file, the commit is complete.
Exercise: Commit Your Changes
1. Use git commit to save the staged changes of the file you’ve
added to your repository.
2. Git will send you to your preferred text editor. There, create a
message, then save and exit.
3. Admire your work with the git status command. You should
see something like:
$ git status
# On branch master
nothing to commit (working directory clean)
If instead you receive a warning indicating that Git is not config‐
ured, you will need to return to “Control the Behavior of Git (git
config)” on page 354 and configure it.
Now, as she makes changes to the files in her repository, Dr. Nu can continue to com‐
mit snapshots as often as she likes. This will create a detailed history of her efforts.
Local Version Control with Git
www.it-ebooks.info
|
359
Exercise: Stage and Commit New Changes
1. Edit your readme file. It should say something like:
Welcome
This is my readme file for a project on parity
violations in the standard model.
1. Stage it for the snapshot (git add).
2. Commit the snapshot (git commit).
3. Add a meaningful commit message.
So far, we’ve learned that the workflow should be :
1. Make changes.
2. git add the files you want to stage for a commit.
3. git commit those files.
4. Fill out the log message.
5. Repeat.
Since that is a lot of steps, note that command-line flags can cut this way down. Some
useful flags for git commit include:
-m: add a commit message from the command line
-a: automatically stage tracked files that have been modified or deleted
-F: add a commit message from a file
--status: include the output of git status in the commit message
--amend: fix the commit message at the repository tip
These can be used in combination to reduce the add/commit/message process to one
command: git commit -am "".
Exercise: Commit and Add a Message in One Step
1. Edit your readme file to tell us whose it is (e.g., “This is Dr.
Nu’s readme…”).
2. Add the file, commit the changes, and append your log mes‐
sage with one command.
Whichever method you use to write commit messages, be sure to make those mes‐
sages useful. That is, write commit messages as if they were the scientific notation
360
|
Chapter 15: Local Version Control
www.it-ebooks.info
that they are. Like section headings on the pages of your lab notebook, these messages
should each be one or two sentences explaining the change represented by a commit.
The frequency of commits is the resolution of your undo button.
Committing frequently makes merging contributions and revers‐
ing changes less painful.
Now we have successfully taken a snapshot representing an incremental change in the
repository and provided a message to go along with it. The fundamental innovation
in version control is that a record of that work has now been kept. To view that
record, we use the log command.
git log: Viewing the History
A log of the commit messages is kept by the repository and can be reviewed with the
log command:
~/parity_code $ git log
commit cf2631a412f30138f66d75c2aec555bb00387af5
Author: Nouveau Nu
Date:
Fri Jun 21 18:21:35 2013 -0500
I have added details to the readme to describe the parity violation project.
commit e853a4ff6d450df7ce3279098cd300a45ca895c1
Author: Nouveau Nu
Date:
Fri Jun 21 18:19:38 2013 -0500
This is my first commit. I have added a readme file.
The log command prints the logged metadata for each commit.
Each commit possesses a unique (hashed) identification number that can be used
to refer to that commit.
The metadata from the configuration step is preserved along with the commit.
Git automatically records the date and time at which the commit occurred.
Finally, the log message for each commit is printed along with that commit.
As more changes are made to the files, more snapshots can be committed, and the log
will reflect each of those commits. After making a number of commits, Dr. Nu can
review the summary of her work in the form of these commit messages.
Local Version Control with Git
www.it-ebooks.info
|
361
When she wants to review her work in more detail than the commit messages can
offer, she may want to review the actual changes that were made between certain com‐
mits. Such differences between file versions are called “diffs,” and the tools that dis‐
play them are diff tools.
Viewing the Differences (git diff)
Let’s recall the behavior of the diff command on the command line. Choosing two
files that are similar, the command:
diff
will output the lines that differ between the two files. This information can be saved
as what’s known as a patch, but we won’t go deeply into that just now. Suffice it to say
that there are many diff tools. Git, however, comes with its own diff system.
The only difference between the command-line diff tool and Git’s diff tool is that the
Git tool is aware of all of the revisions in your repository, allowing each revision of
each file to be treated as a full file.
Thus, git diff will output the changes in your working directory that are not yet
staged for a commit. If Dr. Nu adds a definition of parity to her readme file, but does
not yet commit it, those changes are staged. When she asks Git for the diff, the fol‐
lowing occurs:
~/parity_code $ git diff
diff --git a/readme.rst b/readme.rst
index 28025a7..a5be27f 100644
--- a/readme.rst
+++ b/readme.rst
@@ -2,3 +2,5 @@ Welcome
This is my readme file for a project on parity violations in the standard
model.
+
+In the context of quantum physics, parity is a type of symmetric relation.
Dr. Nu executes the git diff command.
diff reports that the differences not staged for commit exist only in the readme
file versions a and b.
And indicates that these versions are those from the commits ending in 28025a7
and a5be27f, respectively.
To see how this works, make a change in your readme.rst file, but don’t yet commit it.
Then, try git diff.
A summarized version of this output can be seen with the --stat flag:
362
|
Chapter 15: Local Version Control
www.it-ebooks.info
~/parity_code $ git diff --stat
readme.rst | 2 ++
1 file changed, 2 insertions(+)
For each line where one or more characters has been added, an insertion is coun‐
ted. If characters are deleted, this is called a deletion.
To see only what is staged for commit, you can try:
$ git diff --cached
What is the difference shown in the cached diff? What does this mean about what
files are staged?
Sometimes what you have staged is not what you actually want to commit. In the
same way, sometimes after reviewing a change that she has already committed, Dr.
Nu thinks better of it and would prefer to roll a file back to an earlier version. In both
of those instances, the git reset command can be used.
Unstaging or Reverting a File (git reset)
If, after reviewing the log, Dr. Nu decides that she prefers a past version of some file
to the previous revision, she can use git reset. This command can be used either to
unstage a staged file or to roll back a file or files to a previous revision.
If you added a file to the staging area that you didn’t mean to add, you can use reset
to “unstage” it (i.e., take it out of the staged set of commits):
git reset
In this case, reset acts like the opposite of add. However, reset has another use as
well. If you want to return the repository to a previous version, you can use reset for
that too. Just use the commit number:
git reset [] []
reset has some useful mode flags:
--soft
Leaves the contents of your files and repository index alone, but resets repository
head
--mixed
Resets the index and repository head, but not the contents of your files
--hard
Returns the contents of all files and the repository index to the commit specified
Local Version Control with Git
www.it-ebooks.info
|
363
Using reset, you can therefore undo changes that have already been committed. For
changes that have not yet been committed, you can use git checkout. This unstages
modifications:
git checkout --
Note that git checkout has other purposes, which we’ll see soon.
Exercise: Discard Modifications
1. Create five files in your directory, with one line of content in
each file.
2. Commit the files to the repository.
3. Change two of the five files and commit them.
4. Undo the changes in step 3.
5. Print out the last entry in the log.
Using reset or checkout, however, does not delete the commits permanently. The
record of those commits is still stored in the repository, and they can be accessed with
their commit revision hash numbers via git checkout. A more permanent option is
git revert.
Discard Revisions (git revert)
Much like git reset --hard, but with more permanence, git revert is a helpful
tool when you really want to erase history—for example, if you’ve accidentally com‐
mitted something with private or proprietary information. The syntax for git
revert is:
git revert
While she was working on her readme file, Dr. Nu decided to add contact informa‐
tion for herself. In doing so, she committed the following change to the readme:
diff --git a/readme.rst b/readme.rst
index a5be27f..0a07497 100644
--- a/readme.rst
+++ b/readme.rst
@@ -1,5 +1,8 @@
Welcome
+To contact Dr. Nouveau Nu, please send an email to nu@university.edu or call
+her cell phone at 837-5309.
+
This is my readme file for a project on parity violations in the standard
model.
364
|
Chapter 15: Local Version Control
www.it-ebooks.info
A few seconds after committing this change, she regretted making her cell phone
number available. She could edit the readme to remove the number. However, even
after she commits that change, her number can still be accessed with git checkout.
To remove the record entirely, she must use the revert command. First, she needs to
know what commit number to revert, so she uses the log command:
~/parity_code $ git log
commit fc06a890ecba5d16390a6fb4514cb5ba45546952
Author: Nouveau Nu
Date:
Wed Dec 10 14:00:26 2014 -0800
Added my email address and phone number
...
Here, Dr. Nu finds the hash number to use.
She can use the whole hash number or as few as the first eight characters to uniquely
identify the commit she would like to revert:
~/parity_code $ git revert fc06a890
[master 2a5b0e1] Revert "Added my email address and phone number"
1 file changed, 3 deletions(-)
Now it has been completely removed from the history; she can breathe a sigh of relief
and move on. Now Dr. Nu would like to start programming in seriousness. However,
she is concerned about something. Since science involves trying things out and mak‐
ing mistakes, will she have to spend a lot of her time rewinding changes like this to
remove them from the master branch? What if she wants to try two new things at
once? In the next section, we will see that the answer to both of these questions is
“Using branches will make everything easier.”
Listing, Creating, and Deleting Branches (git branch)
Branches are parallel instances of a repository that can be edited and version con‐
trolled in parallel. They are useful for pursuing various implementations experimen‐
tally or maintaining a stable core while developing separate sections of a code base.
Without an argument, the git branch command lists the branches that exist in your
repository:
~/parity_code $ git branch
* master
The “master” branch is created when the repository is initialized. This is the default
branch and is conventionally used to store a clean master version of the source code.
With an argument, the branch command creates a new branch with the given name.
Dr. Nu would like to start a branch to hold some experimental code—that is, some
code that she is just trying out:
Local Version Control with Git
www.it-ebooks.info
|
365
~/parity_code $ git branch experimetal
~/parity_code $ git branch
experimetal
* master
She creates a branch called “experimetal.”
To check that this worked, she lists the branches using the branch command.
The asterisk indicates which branch she is currently in. We’ll demonstrate how to
change branches shortly.
Whoops—Dr. Nu forgot to type the n in experimental. Simple typos like this happen
all the time in programming, but they are nothing to fear. Very few typos will break
everything. In this case, deleting the branch and trying again is very simple. To delete
a branch, she can use the -d flag:
~/parity_code $ git branch -d experimetal
Deleted branch experimetal (was 2a5b0e1).
~/parity_code $ git branch
* master
~/parity_code $ git branch experimental
~/parity_code $ git branch
experimental
* master
She deletes the misspelled branch.
Git responds that, yes, it has been deleted (and provides the hash number for the
HEAD, or most recent commit, of that branch).
Just to double-check, she can list the branches again.
And voilà, it’s gone. She can try again—this time without the typo.
At this point, Dr. Nu has created the “experimental” branch. However, she is still cur‐
rently working in the master branch. To tell Git that she would like to work in the
experimental branch, she must switch over to it with the checkout command.
Switching Between Branches (git checkout)
The git checkout command allows context switching between branches as well as
abandoning local changes and viewing previous commits.
To switch between branches, Dr. Nu can “check out” that branch:
~/parity_code $ git checkout experimental
Switched to branch 'experimental'
~/parity_code $ git branch
366
|
Chapter 15: Local Version Control
www.it-ebooks.info
* experimental
master
Git is actually very good at keeping the user well informed. It can be very
reassuring.
How can you tell when you’ve switched between branches? When we used the branch
command before, there was an asterisk next to the master branch. Now it’s next to the
experimental branch—the asterisk indicates which branch you’re currently in.
Now, Dr. Nu can safely work on code in the experimental branch. When she makes
commits, they are saved in the history of the experimental branch, but are not saved
in the history of the master branch. If the idea is a dead end, she can delete the exper‐
imental branch without polluting the history of the master branch. If the idea is good,
however, she can decide that the commit history made in the experimental branch
should be incorporated into the master branch. For this, she will use a command
called merge.
Merging Branches (git merge)
At some point, the experimental branch may be ready to become part of the master.
The method for combining the changes in two parallel branches is the merge com‐
mand. To merge the changes from the experimental branch into the master, Dr. Nu
executes the merge command from the master branch:
~/parity_code $ git checkout master
~/parity_code $ git merge experimental
Now, the logs in the master branch should include all commits from each branch.
Give it a try yourself with the following long exercise.
Local Version Control with Git
www.it-ebooks.info
|
367
Exercise: Create Two New Branches
1. Create a new repository and commit an empty readme file:
~ $ mkdir geography
~ $ cd geography
~/geography $ git init
~/geography $ touch readme.rst
~/geography $ git add readme.rst
~/geography $ git commit -am "first commit"
2. Create two new branches and list them:
~/geography $ git branch us
~/geography $ git branch texas
3. Add files describing each entity. In the “us” branch, include at
least a file called president. For “texas,” of course, you’ll need a
file called governor. You’ll probably also want one called flower:
~/geography
Switched to
~/geography
~/geography
~/geography
~/geography
Switched to
~/geography
~/geography
~/geography
$ git checkout us
branch 'us'
$ touch president
$ git add president
$ git commit -am "Added president to the us branch."
$ git checkout texas
branch 'texas'
$ touch flower
$ git add flower
$ git commit -am "Added bluebonnets to the
texas branch."
4. Merge the two branches into the master branch:
~/geography $ git checkout master
Switched to branch 'master'
~/geography $ git merge texas
Updating d09dfb9..8ce09f1
Fast-forward
flower | 0
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 flower
~/geography $ git merge us
Merge made by the 'recursive' strategy.
president | 0
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 president
The ability to automatically merge commits is powerful and quite superior to having
multiple versions of directories cluttering your filesystem. That said, the merge com‐
mand is only capable of combining changes that do not conflict with one another. In
the next section, you’ll get a taste of this problem.
368
|
Chapter 15: Local Version Control
www.it-ebooks.info
Dealing with Conflicts
Both Texas and the United States have a national anthem. However, we notice that the
national anthem isn’t there, so we add a file called national_anthem to the “us”
branch:
~/geography
~/geography
~/geography
~/geography
$
$
$
$
git checkout us
echo "Star-Spangled Banner" > national_anthem
git add national_anthem
git commit -am "Added star spangled banner to the us branch."
Next, of course, we put on our Wranglers and Stetsons and do the same for the
“Texas” branch, which does not yet have a national anthem file.
~/geography/$
~/geography/$
~/geography/$
~/geography/$
git checkout texas
echo "Texas, Our Texas" > national_anthem
git add national_anthem
git commit -am "Added Texas, Our Texas to the texas branch."
If we merge them into one another or into the master branch, what happens?
What happens is a conflict. This is a common issue when two different people are
working independently on different branches of a respository and try to merge them.
Since that is the context in which conflicts are most commonly encountered, the
explanation of how to deal with conflicts will be addressed in the next chapter. For
now. abort the merge with git merge --abort.
Version Conrol Wrap-Up
In this chapter, we have shown how to use git for recording versions of files, rewind‐
ing changes, and merging independent changes. These are the first steps toward
reproducible scientific computation. Having read this chapter, you are now prepared
to go forth and version control the work you do day to day. In fact, take a moment
now to place those analysis scripts you are working on under version control. Since
Git is an expansive and complex tool, you may find a need for additional resources.
We can recommend, in particular:
• Pro Git Book
• Software Carpentry’s Git Lessons
• The Software Carpentry quick reference
Now that you are comfortable with managing versions of files and source code
locally, you can move forward and program reproducibly. Next, you will need to
know how to harness the power of Git for collaboration. The next chapter will cover
the use of Git in combination with the immense power of the Internet.
Version Conrol Wrap-Up
www.it-ebooks.info
|
369
www.it-ebooks.info
CHAPTER 16
Remote Version Control
Now that you have learned how to version files locally with Git, you are ready to rev‐
olutionize the way you collaborate on software, papers, data, and everything else. This
chapter will cover the immense power of Git when it is combined with the broad
reach of the Internet.
Chapter 15 described tasks related to the local working copy of your repository. How‐
ever, the changes you make in this local copy aren’t backed up online automatically.
Until you send those changes to the Internet, the changes you make are local changes.
This chapter will discuss syncing your local working copy with remote copies online
and on other computers on your network. In particular, this chapter will explain how
to use Git and the Internet for:
• Backing up your code online
• Forking remote repositories to enable collaboration
• Managing files in a collaboration
• Merging simultaneous changes
• Downloading open source code to keep track of updates
First among these, this chapter will cover backing up code online.
Repository Hosting (github.com)
Repositories can be stored and accessed through repository hosting servers online.
Many people store their source code repositories on common repository hosting
services such as:
• Launchpad
371
www.it-ebooks.info
• Bitbucket
• Google Code
• SourceForge
• GitHub
This chapter will use GitHub as an example. It provides tools for browsing, collabo‐
rating on, and documenting code. These include:
• Landing page support
• Wiki support
• Network graphs and time histories of commits
• Code browser with syntax highlighting
• Issue (ticket) tracking
• User downloads
• Varying permissions for various groups of users
• Commit-triggered mailing lists
• Other service hooks (e.g., Twitter)
These services allow anyone with a repository to back up their work online and
optionally share it with others. They can choose for it to be either open source or pri‐
vate. Your home institution may have a repository hosting system of its own. To find
out, ask your system administrator.
Setting up a repository on GitHub requires a GitHub username
and password. Please take a moment to create a free GitHub
account.
Additionally, you may find it helpful to set up SSH keys for auto‐
matic authentication.
Dr. Nu can use GitHub as a way to back up her parity work, share it with her graduate
students, and demonstrate its fidelity to paper reviewers. Since her parity_code simu‐
lation software from the previous chapter already exists, she can upload it to GitHub
in four simple steps:
1. Create a user account on GitHub.
2. Create a space for her repository on GitHub.
3. Point to that remote from the local copy.
4. Push her repository to that location.
372
| Chapter 16: Remote Version Control
www.it-ebooks.info
The first two of these steps occur within the GitHub interface online.
Creating a Repository on GitHub
Setting up a user account creates a space for a user like Dr. Nu to collect all of the
repositories she uses. Creating a repository names location within that space for a
certain piece of software.
When she creates a username (NouveauNu) on GitHub, a location on its servers is
reserved for her at github.com/NouveauNu. If she navigates in her browser to that
location, she can click a big green button that says “New Repository.” She can supply
the repository name parity_code, and GitHub will respond by creating an empty
space at github.com/NouveauNu/parity_code.
This location is called a “remote” location because it is distant from the local working
copy. Now that the repository location has been created, Git can be used to send com‐
mits from the local copy to the remote. For this, Dr. Nu needs to alert the local copy
to the existence of the remote.
Declaring a Remote (git remote)
Remote repositories are just like local repositories, except they are stored online. To
synchronize changes between the local repository and a remote repository, the loca‐
tion of the remote must be registered with the local repository.
The git remote command allows the user to register remote repository URLs under
shorter aliases. In particular, this command can be used to add, name, rename, list,
and delete remote repository aliases. The original remote repository, with which a
local copy is meant to synchronize, is called “origin” by convention. In our example,
this repository is the one where Dr. Nu holds the master copy of parity_code. So, from
Dr. Nu’s local working copy of the parity_code repository, she creates an alias to the
remote thusly:
$ git remote add origin https://github.com/NouveauNu/parity_code.git
git remote command declares that Git should register something about a
remote repository. add declares that a remote repository alias should be added.
Dr. Nu chooses to use the conventional alias for this repository, origin. She then
associates the alias with the online location of the repository. This URL can be
copied from the GitHub page holding this repository.
Once she has executed this command, her local repository is now synced with the
one online. She is then capable of sending and receiving commits from the remote
repository. She can see a list of remotes registered with the remote command and its
“verbose” flag:
Creating a Repository on GitHub
www.it-ebooks.info
|
373
~/parity_code $ git remote -v
origin https://github.com/NouveauNu/parity_code.git (fetch)
origin https://github.com/NouveauNu/parity_code.git (push)
The -v flag is common and means “verbose.” In this case, it means “verbosely list
the remotes.”
The origin alias is associated with the URL Dr. Nu provided. The meanings of
fetch and push will be covered very shortly.
She can now use this remote alias to “push” a full copy of the current status of the
parity_code repository onto the Internet.
Sending Commits to Remote Repositories (git push)
The git push command pushes commits in a local working copy to a remote reposi‐
tory. The syntax is:
git push [options]
To push a copy of her parity_code repository up to the Internet, Dr. Nu can therefore
execute the command:
~/parity_code (master) $ git push origin master
Username for 'https://github.com': nouveaunu
Password for 'https://nouveaunu@github.com':
Counting objects: 22, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (22/22), 2.19 KiB | 0 bytes/s, done.
Total 22 (delta 3), reused 0 (delta 0)
To https://github.com/NouveauNu/parity_code
* [new branch]
master -> master
Dr. Nu pushes the current (master) branch up to the origin remote.
GitHub requires a username.
And a password.
The master branch has been pushed online.
This sends the full history of the local master branch up to the “origin” remote on
GitHub, as seen in Figure 16-1. For security, the GitHub servers ask Dr. Nu for her
username and password before the push is accepted. Only users with the appropriate
permissions can push changes to this repository. In this case, Dr. Nu is the only user
with permission to do so.
374
|
Chapter 16: Remote Version Control
www.it-ebooks.info
Figure 16-1. Pushing to a remote
To access the files that are now online, Dr. Nu can navigate in a browser to the loca‐
tion of that repository online. Indeed, so can her collaborators. This is where the
magic begins. Since Dr. Nu has collaborators at other universities who rely on her
software to do their analysis, GitHub can be very helpful for sharing that software
with them. In particular, Fran Faraway, a postdoc on another continent, can now keep
up to date without any emails or phone calls to Dr. Nu. Now that Dr. Nu’s code is
online, Fran can use Git to download it from that location on GitHub using the clone
command.
Downloading a Repository (git clone)
Like Dr. Nu’s parity code, many useful open source scientific software libraries are
kept in repositories online. With the help of Git, scientists relying on these scientific
libraries can acquire up-to-date source code for their use and modification. The best
way to download such a repository is to clone it, as illustrated in Figure 16-2.
When a repository is cloned, a local copy is created on the local computer. It will
behave as a fully fledged local repository where local branches can be created, edits
can be made, and changes can be committed. To clone the parity_code repository,
Fran can use the syntax:
~/useful_software $ git clone https://github.com/NouveauNu/parity_code.git
This command downloads the online repository into a directory called parity_code.
Downloading a Repository (git clone)
www.it-ebooks.info
|
375
Figure 16-2. Cloning a repository
Exercise: Clone a Repository from GitHub
1. Pick any repository you like. There are many cool projects
hosted on GitHub. Take a few minutes here to browse GitHub
and pick a piece of code of interest to you.
2. Clone it. If you didn’t find anything cool, we can suggest clon‐
ing the AstroPy libraries:
~ $ git clone https://github.com/astropy/astropy.git
Cloning into astropy...
remote: Counting objects: 24, done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 24 (delta 7), reused 17 (delta 1)
Receiving objects: 100% (24/24), 74.36 KiB, done.
Resolving deltas: 100% (7/7), done.
3. You should see many files download themselves onto your
machine. These files will have been placed in a directory with
the name of the repository. Let’s make sure it worked. Change
directories, and list the contents:
~/ $ cd astropy
~/ $ ls
Now that she has cloned Dr. Nu’s repository, Fran has a full copy of its history. Fran
can even edit the source code for her own purposes and push that to her own reposi‐
376
|
Chapter 16: Remote Version Control
www.it-ebooks.info
tory online without affecting Dr. Nu whatsoever. An example of this process is shown
in Figure 16-3.
Figure 16-3. Creating new remotes
In Figure 16-3, Fran has pushed her own changes up to her own GitHub repository.
Sometimes, when it is initialized in a particular way, this is called a “fork” of the origi‐
nal repository. Forks are a GitHub notion, rather than a Git notion. Basically, they are
mutually aware remote repositories. To create them, simply locate the Fork button at
the top-righthand corner of any repository on GitHub. That creates a new repository
in your user space from which you can clone your work. A project managed in this
way between the Curies might have the structure demonstrated in Figure 16-4.
Downloading a Repository (git clone)
www.it-ebooks.info
|
377
Figure 16-4. Forks in the Curie Pitchblende collaboration
This distributed system of remotes scales nicely for larger collaborations.
Since Dr. Nu’s code is under active development, Fran must update her own local
repository when Dr. Nu makes improvements to the code. By default, Fran’s cloned
repository is configured to be easily updated: the origin remote alias is registered by
default and points to the cloned URL. To download and incorporate new changes
from this remote repository (origin), Fran will require the git fetch and git merge
commands.
378
|
Chapter 16: Remote Version Control
www.it-ebooks.info
Exercise: Fork the GitHub Repository
While you may already have a copy of this repository, GitHub
doesn’t know about it until you’ve made a fork. You’ll need to tell
GitHub you want to have an official fork of this repository.
1. Go to github.com/nouveaunu/parity_code in your Internet
browser, and click on the Fork button.
2. Clone it. From your terminal:
$ git clone https://github.com//parity_code.git
$ cd parity_code
In the place of , put your actual GitHub username.
3. Now, create an alias for the remote repository:
$ git remote add nu \
https://github.com/nouveaunu/parity_code.git
$ git remote -v
origin
https://github.com/YOU/parity_code (fetch)
origin
https://github.com/YOU/parity_code (push)
nu https://github.com/nouveaunu/parity_code (fetch)
nu https://github.com/nouveaunu/parity_code (push)
Create a remote alias called nu that points at the original
repository.
List the remotes to see the effect.
The origin remote is set by default during the cloning
step.
Fetching the Contents of a Remote (git fetch)
Since the cloned repository has a remote that points to Dr. Nu’s online repository, Git
is able to fetch information from that remote. Namely, the git fetch command can
retrieve new commits from the online repository. In this case, if Fran wants to
retrieve changes made to the original repository, she can git fetch updates with the
command:
~/useful_software/parity_code $ git fetch origin
The fetch command merely pulls down information about recent changes from the
original master (origin) repository. By itself, the fetch command does not change
Fran’s local working copy. To actually merge these changes into her local working
copy, she needs to use the git merge command
Fetching the Contents of a Remote (git fetch)
www.it-ebooks.info
|
379
Merging the Contents of a Remote (git merge)
To incorporate upstream changes from the original master repository (in this case,
NouveauNu/parity_code) into her local working copy, Fran must both fetch and
merge. If Fran has made many local changes and commits, the process of merging
may result in conflicts, so she must pay close attention to any error messages. This is
where version control is very powerful, but can also be complex.
Exercise: Fetch and Merge the Contents of a GitHub Repository
1. In the repository you cloned, fetch the recent remote reposi‐
tory history:
$ git fetch origin
2. Merge the origin master branch into your master branch:
$ git merge origin/master
3. Find out what happened by browsing the directory.
This process of fetching and merging should be undertaken any time a repository
needs to be brought up to date with a remote. For brevity, both of these steps can be
achieved at once with the command git pull.
Pull = Fetch and Merge (git pull)
The git pull command is equivalent to executing git fetch followed by git
merge. Though it is not recommended for cases in which there are many branches to
consider, the pull command is shorter and simpler than fetching and merging as it
automates the branch matching. Specifically, to perform the same task as we did in
the previous exercise, the pull command would be:
$ git pull origin master
Already up-to-date.
When there have been remote changes, the pull will apply those changes to your local
branch. It may require conflict resolution if there are conflicts with your local
changes.
When Dr. Nu makes changes to her local repository and pushes them online, Fran
must update her local copy. She should do this especially if she intends to contribute
back to the upstream repository and particularly before making or committing any
changes. This will ensure Fran is working with the most up-to-date version of the
repository:
~/useful_software/parity_code $ git pull
Already up-to-date.
380
|
Chapter 16: Remote Version Control
www.it-ebooks.info
The “Already up-to-date” response indicates that no new changes need to be added.
That is, there have not been any commits to the original repository (origin) since the
most recent update.
Conflicts
If Dr. Nu and Fran Faraway make changes in different files or on different lines of the
same file, Git can merge these changes automatically. However, if for some reason
they make different changes on the same line of a certain file, it is not possible for the
merge (or pull) command to proceed automatically. A conflict error message will
appear when Fran tries to merge in the changes.
This is the trickiest part of version control, so let’s take it very carefully.
In the parity_code repository, you’ll find a file called readme.rst. This is a standard
documentation file that appears rendered on the landing page for the repository in
GitHub. To see the rendered version, visit your fork on GitHub. The first line of this
file is “Welcome.”
For illustration, let’s imagine that both Dr. Nu and Fran Faraway suddenly decide they
would like to welcome visitors in the tongue of their home nation. Since these two
collaborators are from two different places, there will certainly be disagreements
about what to say instead of “Welcome.” This will cause a conflict.
First, since Dr. Nu is French, she alters, commits, and pushes the file such that it
reads:
Bonjour
This is my readme file for a project on parity violations in the standard
model.
In the context of quantum physics, parity is a type of symmetric relation.
Fran, however, is from Texas, so she commits her own version of “Welcome.”
Howdy
This is my readme file for a project on parity violations in the standard
model.
In the context of quantum physics, parity is a type of symmetric relation.
Before pushing her change to her own remote, Fran updates her repository to include
any changes made by Dr. Nu. The result is a conflict:
~/useful_software/parity_code $ git merge origin
Auto-merging readme.rst
CONFLICT (content): Merge conflict in readme.rst
Automatic merge failed; fix conflicts and then commit the result.
Conflicts
www.it-ebooks.info
|
381
Since the two branches have been edited on the same line, Git does not have an algo‐
rithm to merge the changes correctly.
Resolving Conflicts
Now what?
Git has paused the merge. Fran can see this with the git status command:
$ git status
# On branch master
# Unmerged paths:
#
(use "git add/rm ..." as appropriate to mark resolution)
#
#
unmerged:
readme.rst
#
no changes added to commit (use "git add" and/or "git commit -a")
The only thing that has changed is the readme.rst file. Opening it, Fran sees some‐
thing like this :
<<<<<<< HEAD
Howdy
=======
Bonjour
>>>>>>> master
This is my readme file for a project on parity violations in the standard
model.
In the context of quantum physics, parity is a type of symmetric relation.
Git has added this line to mark where the conflict begins. It should be deleted
before a resolution commit.
The change that Fran committed.
Git has added this line to mark the separation between the two conflicting ver‐
sions. It should be deleted before a resolution commit.
The change that Dr. Nu committed.
Git has added this line to mark where the conflict ends. It should be deleted
before a resolution commit.
The intent is for Fran to edit the file intelligently and commit the result. Any changes
that Fran commits at this point will be accepted as a resolution to this conflict.
382
|
Chapter 16: Remote Version Control
www.it-ebooks.info
Fran knows now that Dr. Nu wanted the “Welcome” to say “Bonjour.” However, Fran
also wants it to say “Howdy,” so she should come up with a compromise. First, she
should delete the marker lines. Since she wants to be inclusive, she then decides to
change the line to include both greetings. Decisions such as this one must be made by
a human, and are why conflict resolution is not handled more automatically by the
version control system:
Howdy and Bonjour
This is my readme file for a project on parity violations in the standard
model.
In the context of quantum physics, parity is a type of symmetric relation.
This results in the status:
$ git status
# On branch master
# Unmerged paths:
#
(use "git add/rm ..." as appropriate to mark resolution)
#
# both modified:
readme.rst
#
...
no changes added to commit (use "git add" and/or "git commit -a")
Now, to alert Git that she has made appropriate alterations, Fran follows the instruc‐
tions it gave her in the status message (namely, git add and git commit those
changes):
$ git commit -am "Compromises merge conflict to Howdy and Bonjour"
$ git push origin master
Counting objects: 10, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 762 bytes, done.
Total 6 (delta 2), reused 0 (delta 0)
To https://github.com/username/repositoryname.git
Explain your solution to the merge conflict in the log message.
Push the results online.
And that is it. Now the repository contains the history of both conflicting commits as
well as a new commit that merges them intelligently. The final result is the version
Fran has just committed.
Resolving Conflicts
www.it-ebooks.info
|
383
Remote Version Control Wrap-up
In this chapter, we have shown how to use Git along with GitHub for downloading,
uploading, and collaborating on code. Combined with remote repository hosting
sites, the skills learned in Chapter 15 allow the scientist to manage files and change
sets, merge simultaneous work among collaborators, and publish that work on the
Internet.
Having read this chapter, you are now prepared to open your code to your collabora‐
tors. Go forth and version control the work you do day to day. In fact, take a moment
now to place those analysis scripts you are working on under version control. Since
it’s an expansive and complex tool, you may find a need for additional resources on
Git. We can recommend, in particular:
• Pro Git book (Apress), a free and open source ebook
• Software Carpentry’s Version Control with Git
• Software Carpentry’s Git Reference
Now that you are comfortable with pulling, pushing, fetching, merging, and dealing
with conflicts, you should be able to collaborate with your colleagues on code and
papers more smoothly and reproducibly. Next, you will need to know how to find
and fix bugs so that your collaborative, reproducible software is reproducibly correct.
384
|
Chapter 16: Remote Version Control
www.it-ebooks.info
CHAPTER 17
Debugging
The scientific method’s central motivation is the ubiquity of error—the awareness that
mistakes and self-delusion can creep in absolutely anywhere and that the scientist’s
effort is primarily expended in recognizing and rooting out error.
—Donoho 2009
In the very early days of computing, Admiral Grace Hopper and her team on the
Mark II computer encountered errors in the performance of the computer. Ulti‐
mately, a moth was discovered in one of the relays. Admiral Hopper reportedly
remarked that they were “debugging” the system. Though the term had been used
before in engineering, this event popularized the terms bug and debugging for the
causes and solutions, respectively, of errors in computer code and performance.
Bugs are errors in code, and they are ubiquitous reminders of our humanity. That is,
computers, by their very nature, do exactly what we tell them to do. Therefore, bugs
are typically imperfections in syntax and logic introduced by humans. However care‐
ful we are, bugs will be introduced while we are developing code. They begin to be
introduced as soon as we start writing a piece of code. For this reason, we must be
vigilant, and we must be prepared to fix them when they arise. This chapter will pre‐
pare you to recognize, diagnose, and fix bugs using various tools and methods for
“debugging” your code. It will do so by introducing:
• When, how, and by whom bugs are encountered
• Methods of diagnosing bugs
• Interactive debugging, for diagnosing bugs quickly and systematically
• Profiling tools to quickly identify memory management issues
• Linting tools to catch style inconsistencies and typos
385
www.it-ebooks.info
Of these, the most time will be spent on using the pdb interactive debugger in Python,
since it is essential for debugging issues whose source is not obvious from tests or
simple print statements. First, however, we will discuss the instigating event itself:
encountering a bug.
Encountering a Bug
A bug may take the form of incorrect syntax, imperfect logic, an infinite loop, poor
memory management, failure to initialize a variable, user error, or myriad other
human mistakes. It may materialize as:
• An unexpected error while compiling the code
• An unexpected error message while running the code
• An unhandled exception from a linked library
• An incorrect result
• An indefinite pause or hang-up
• A full computer crash
• A segmentation fault
• Silent failure
Developers may encounter bugs that they or their colleagues have introduced in their
source code. Users, also, test the limits of a program when they deploy it for their own
uses. Irrespective of when a bug is found or who finds it, bugs must be fixed. Faced
with any of them, whether it be a build error, a runtime exception, or a segmentation
fault, the user or developer should first try to track down and fix the bug.
Bugs can also be encountered at any time. In the next chapter, we will explain how
testing should do most of the verification work in a piece of scientific software. That
is, most bugs in a well-developed piece of software should be found by the tests
before they make it to the trusted version of the code.
A well-written test suite should do the heavy lifting in finding and
diagnosing bugs. Interactive debugging lifts the remainder.
However, tests rarely cover all edge and corner cases, so bugs can slip through the
cracks. The longer a bug exists undetected in a piece of trusted software, the more
dire the situation:
1. If a bug is found in testing, it can be fixed before the software is ever used.
386
|
Chapter 17: Debugging
www.it-ebooks.info
2. If a bug is found before there are users, it can be fixed before it affects anyone
running the code.
3. If a bug is found when the code is run, it can be fixed before analysis is done on
the results.
4. If a bug is found when the results of the code are analyzed, it can be fixed before
the results are published in a journal article.
5. If a bug is found after the results are published, the paper has to be retracted.
Many papers are retracted every year due to bugs in code. Chapter 18 will show you
how to improve your software tests to avoid getting past stage 1 in the preceding list.
However, bugs occasionally slip through the cracks, so you must be ready to
encounter and fix them at any stage of your project.
While bugs can be found by anyone, they are usually only diagnosable and fixable by
people who know what the code is meant to do. Without knowing what the code is
meant to do, it is nearly impossible to know when a result is incorrect, when a long
pause is suspicious, or when silent termination indicates failure.
Now that you know how a bug is encountered, you are ready to learn how to diag‐
nose its cause. In order to walk before we run, we will first introduce a simplistic way
to diagnose bugs in code: print statements.
Print Statements
Print statements are every developer’s first debugger. Because of this, we’ll start here
—but know that they are not the best practice for effective computing and that we
will be covering better methods later in the chapter. Printing is typically a check that
asks one or both of these questions:
• Is the bug happening before a certain line?
• What is the status of some variable at that point?
In a simple, buggy program, a print statement can answer the first question if it is
inserted at a place where the program is suspected of misbehavior.
In Python 3, print(x) is a function call, and thus an expression. In
Python 2, print x was a statement and used a slightly different syn‐
tax. Still, the term print statement is used across many program‐
ming languages, even when printing to the screen is not technically
a statement on its own (like in Python 3, C, and C++).
In the following example, something about the code is causing it to “hang.” That is, it
simply seems to run forever, as if stalled:
Print Statements
www.it-ebooks.info
|
387
def mean(nums):
bot = len(nums)
it = 0
top = 0
while it < len(nums):
top += nums[it]
return float(top) / float(bot)
if __name__ == "__main__":
a_list = [1, 2, 3, 4, 5, 6, 10, "one hundred"]
mean(a_list)
It is likely that you can determine the cause of this problem by visual inspection.
However, in the case that you cannot, a print statement can be inserted where the
code is suspected to be hanging:
def mean(nums):
bot = len(nums)
it = 0
top = 0
print("Still Running at line 5")
while it < len(nums):
top += nums[it]
print(top)
return float(top) / float(bot)
if __name__ == "__main__":
a_list = [1, 2, 3, 4, 5, 6, 10, "one hundred"]
mean(a_list)
This print() is added to determine where the error is happening.
This one is added to determine what is happening to the variables during the
loop.
Once a print statement is inserted at the suspected point of misbehavior, the program
can be executed, and the print statement either appears before the exception, or does
not appear at all. In the case shown here, many things are wrong with the code. How‐
ever, the most fatal is the infinite while loop. Since the print statement appears before
the code enters the infinite loop, the troublemaking line must be after the print state‐
ment. In this case, the first print statement is printed, so it is clear that the error
occurs after line 5. Additionally, the second print statement results in “1” being
printed infinitely. Can you tell what is wrong in the code using this information? The
infinite loop can certainly be fixed as in the code shown here:
def mean(nums):
top = sum(nums)
bot = len(nums)
return float(top) / float(bot)
388
| Chapter 17: Debugging
www.it-ebooks.info
if __name__ == "__main__":
a_list = [1, 2, 3, 4, 5, 6, 10, "one hundred"]
mean(a_list)
Rather than looping needlessly, the sum() function can be applied to the list.
Print statements like this can provide very helpful information for pinpointing an
error, but this strategy does not scale. In a large code base, it usually takes more than
a few tries or a few more print statements to determine the exact line at which the
error occurred. Additionally, the number of potentially problem-causing variables
increases as the size of the code base increases. A more scalable solution is needed:
interactive debugging.
Interactive Debugging
Rather than littering one’s code base with print statements, interactive debuggers
allow the user to pause during execution and jump into the code at a certain line of
execution. Interactive debuggers, as their name suggests, allow the developer to query
the state of the code in an interactive way. They allow the developer to move forward
through the code execution to determine the source of the error.
Interactive debugging tools generally enable the user to:
• Query the values of variables
• Alter the values of variables
• Call functions
• Do minor calculations
• Step line by line through the call stack
All of this can help the developer to determine the cause of unexpected behavior, and
these features make interactive debuggers an excellent exploratory tool if used sys‐
tematically and intentionally. That is, without a strong notion of the expected behav‐
ior or a consistent plan of action, an interactive debugger only enables a developer to
attempt random changes to the code and to query variables at random, just hoping
for a change in behavior. This strategy is inefficient and error-prone. There are ways
to ensure that you are being systematic:
• Before investigating a line of code, ask yourself how the error could be caused by
that part of the code.
• Before querying the value of a variable, determine what you expect the correct
value to be.
• Before changing the value of a variable, consider what the effect of that change
should be.
Interactive Debugging
www.it-ebooks.info
|
389
• Before stepping forward through the execution, make an educated guess about
what will indicate an error or success.
• Keep track of the things you try and the changes you make. Use version control
to track changes to the files and use a pen and paper to track leads followed.
Now that you have some rules to live by, we can get started debugging interactively.
The next sections of this chapter will cover an interactive debugger in the Python
standard library, pdb.
Debugging in Python (pdb)
For Python code, interactive debugging can be achieved with the Python Debugger
(pdb). It provides an interactive prompt where code execution can be paused with a
trace and subsequent breakpoints. Then, the state of the code and variables can be
queried, stepped through line by line, restarted, and modified. This section will
describe all of these in the context of the still-failing mean code from the previous
example.
That is, even though we have fixed the infinite loop in the previous example, another,
different error arises.
Running
$ python a_list_mean.py
returns
Traceback (most recent call last):
File "a_list_mean.py", line 9, in
mean(a_list)
File "a_list_mean.py", line 2, in mean
top = sum(nums)
TypeError: unsupported operand type(s) for +: 'int' and 'str'
There is still some kind of error. It looks like it has to do with the types of the
values in the list. Maybe we can use the debugger to check if changing the nonint list values to a number will resolve the error.
To diagnose this error with pdb, we must first import the pdb module into the script:
import pdb
def mean(nums):
top = sum(nums)
bot = len(nums)
return float(top) / float(bot)
if __name__ == "__main__":
390
|
Chapter 17: Debugging
www.it-ebooks.info
a_list = [1, 2, 3, 4, 5, 6, 10, "one hundred"]
mean(a_list)
Import pdb into the file containing the suspiciously buggy code.
We must make one more edit to the file in order to begin. To tell the debugger where
in the source code we would like to “jump into” the execution, we must set a trace.
Setting the Trace
Rather than inserting a new print statement on a new line every time new informa‐
tion is uncovered, you can set a trace point at the line where you would like to enter
the program interactively in the debugger. You do so by inserting the following line
into the source code:
pdb.set_trace()
This trace pauses the execution of the program at the line where it appears. When the
program is paused, pdb provides an interface through which the user can type pdb
commands that control execution. Using these commands, the user can print the
state of any variable that is in scope at that point, step further forward through the
execution one line at a time, or change the state of those variables.
Exercise: Set a Trace
1. Create a file containing the buggy mean code.
2. Import pdb in that file.
3. Decide where you would like to set a trace and add a line there
that reads pdb.set_trace().
4. Save the file. If you try running it, what happens?
In the mean script, an appropriate trace point to set might be at the very beginning of
execution. It is a short program, and starting at the beginning will cover all the bases:
import pdb
def mean(nums):
top = sum(nums)
bot = len(nums)
return float(top) / float(bot)
if __name__ == "__main__":
pdb.set_trace()
a_list = [1, 2, 3, 4, 5, 6, 10, "one hundred"]
mean(a_list)
Debugging in Python (pdb)
www.it-ebooks.info
|
391
The trace point is set at the beginning of execution.
Now, when the script is run, the Python debugger starts up and drops us into the
code execution at that line.
python a_list_mean.py returns
> /filespace/users/h/hopper/bugs/a_list_mean.py(10)()
-> a_list = [1, 2, 3, 4, 5, 6, 10, "one hundred"]
(pdb)
The pdb prompt looks like (pdb). This is where you enter debugging commands.
Since the location of the trace was set before anything happens at all in the program,
the only object in scope is the definition of the mean() function. The next line initial‐
izes the a_list object. If we were to step forward through the execution, we would
expect to see that happen. The interactive debugger enables us to do just that.
Stepping Forward
In any interactive debugger, once a trace point is reached, we can explore further by
stepping slowly forward through the lines of the program. This is equivalent to
adding a print statement at each line of the program execution, but takes much less
time and is far more elegant.
The first time using a tool, you should find out how to get help. In pdb, typing help
provides a table of available commands. Can you guess what some of them do?
Documented commands (type help ):
========================================
EOF
bt
cont
enable jump pp
run
a
c
continue exit
l
q
s
alias cl
d
h
list quit
step
args
clear
debug
help
n
r
tbreak
b
commands
disable
ignore next restart u
break condition down
j
p
return
unalias
unt
until
up
w
whatis
where
Miscellaneous help topics:
==========================
exec pdb
Undocumented commands:
======================
retval rv
To move forward through the code, for example, we would use the command step.
Note also the s command listed above step. This is a shorthand for the step function.
Either s or step can be used to move forward through execution one step.
392
|
Chapter 17: Debugging
www.it-ebooks.info
Exercise: Step Through the Execution
1. Run your script from the last exercise.
2. Determine the expected effects of stepping through the execu‐
tion by one line.
3. Type s. What just happened?
After the step, the program state is paused again. Any variables in scope at that line
are available to be queried. Now that we have stepped forward one line, the a_list
object should be initialized. To determine whether that is truly the case when the code
is run, and whether a_list has been assigned the list that we expect, we can use pdb
to print the value of the a_list variable that is suspicious.
Querying Variables
Since valid Python is valid in the pdb interpreter, simply typing the name of the vari‐
able will cause pdb to print its value (alternatively, the print function could be used):
Code
Returns
(Pdb) s
> /filespace/users/h/hopper/bugs/
a_list_mean.py(10)()
-> mean(a_list)
(Pdb) a_list
[1, 2, 3, 4, 5, 6, 10, 'one hundred']
Now, while it is clear that the variable is being set to the value we expect, it is suspect.
If you recall, the error we received involved a type mismatch during the summation
step. The string value one hundred may not be a valid input for the summation func‐
tion. If we can change the value of that element to an int, it may be a more valid
input for the summation. To test this with the debugger, we will need to execute a
command that resets the value of the last element of a_list. Then, if we continue the
execution of the code, we should see the summation function succeed.
Now, how do we change the last element of a_list while we are in pdb?
Setting the State
Since we have a guess about what the variable should be at this point, we can make
that happen in the interactive debugger with simple interactive Python. Just as we
could in the Python interpreter, we can set the value of the last element to 100 with
a_list[-1]=100:
Debugging in Python (pdb)
www.it-ebooks.info
|
393
Code
Returns
(Pdb) a_list[-1] = 100
(Pdb) a_list
[1, 2, 3, 4, 5, 6, 10, 100]
Excellent. That was easy! Now that the program should be in a state that will not
crash the summation function, we should check that the summation function works.
How do we execute functions within the debugger?
Running Functions and Methods
In addition to variables, all functions and methods that are in scope at the breakpoint
are also available to be run within the debugging environment. So, just as we would in
the Python interpreter, we can execute sum(a_list):
Code
Returns
(Pdb) sum(a_list)
131
It turns out that our initial hunch was correct. Changing the string version one hun
dred to the integer version (100) allowed the summation function not to choke. Now
we would like to tell pdb to continue the execution in order to see whether our
change allows the program to run through to its finish without error. How do we
continue the execution?
Continuing the Execution
Rather than stepping through the rest of the code one line at a time, we can continue
the execution through to the end with the continue command. The shorthand for
this command is c. If the execution succeeds, we, the developers, will know that
changing the code in the Python script will solve our problem.
Exercise: Continue the Execution to Success
1. Run the script from the previous exercise.
2. Step forward one line.
3. Change one hundred to 100 in a_list.
4. Continue execution with c. What happened? Was the mean of
the list printed correctly? Why?
Now that the final element of the list is no longer a string (it has been set to the inte‐
ger 100), the execution should succeed when the continue command is entered. The
394
|
Chapter 17: Debugging
www.it-ebooks.info
continue command, as you can see, proceeds with the execution until the program
ends. The actual file can now be edited to capture this bug fix. The script that calcu‐
lates the mean should now be similar to the following:
def mean(nums):
top = sum(nums)
bot = len(nums)
return float(top) / float(bot)
if __name__ == "__main__":
a_list = [1, 2, 3, 4, 5, 6, 10, 100]
result = mean(a_list)
print result
Sometimes, however, you may not be interested in running the execution all the way
to the end. There may be some other place in the execution where the state of the
variable should be checked. For this reason, the continue command stops if a break‐
point is reached. What is a breakpoint?
Breakpoints
If there is only one suspicious point in the execution, then setting the trace at that
point or shortly before it is sufficient. However, sometimes a variable should be
checked at many points in the execution—perhaps every time a loop is executed,
every time a certain function is entered, or right before as well as right after the vari‐
able should change values. In this case, breakpoints are set.
In pdb, we can set a breakpoint using the break or shorthand b syntax. We set it at a
certain line in the code by using the line number of that place in the code or the name
of the function to flag:
b(reak) ([file:]lineno | function)[, condition]
With breakpoints, new lines can be investigated as soon as they become suspicious.
Just set the breakpoint and call the continue function. The execution will continue
until pdb encounters the line at which you have set the breakpoint. It will then pause
execution at that point.
However, for this to work, you have to know where to put the breakpoint. In order to
know that, the developer often has to know the code execution path that led to the
error or crash. That list is called the backtrace, and it can be accessed from the pdb
debugger quite easily with the bt command, which outputs the stack of commands
that led up to the current state of the program. Sometimes also called the call stack,
execution stack, or traceback, it answers the question “How did we get here?”
With that, you should have enough information to begin debugging your code. How‐
ever, the job is not done. Even when your code is no longer exhibiting actual errors,
there may still be issues that slow it down or are otherwise nonoptimal. To increase
Debugging in Python (pdb)
www.it-ebooks.info
|
395
the speed of your code, it is helpful to know which parts are the slowest. The next
section focuses on just how to find that out.
Profiling
Tools called profilers are used to sketch a profile of the time spent in each part of the
execution stack. Profiling goes hand in hand with the debugging process. When there
are suspected memory errors, profiling is the same as debugging. When there are
simply memory inefficiencies, profiling can be used for optimization.
For example, certain for loops may be the source of slowdowns in a piece of software.
Since we can often reduce for loops by vectorizing them, it is tempting to guess that
the best solution is to rewrite all for loops in this more complex manner. However,
that is a lower-level programming task that takes programmer effort. So, instead of
vectorizing all for loops, it is best to find out which ones are the slowest, and focus
on those.
In Python, cProfile is a common way to profile a piece of code. For our
fixed_mean.py file, in which the bugs have been fixed, cProfile can be executed on the
command line, as follows:
$ python -m cProfile -o output.prof fixed_mean.py
Give the output file a name. It typically ends in the prof extension.
Provide the name of the Python code file to be examined.
That creates a profile file in a binary format, which must be read by an interpreter of
such files. The next section will discuss such an interpreter.
Viewing the Profile with pstats
One fast option is to use the pstats module. In an interactive Python session, the
print_stats() function within the pstats package provides a breakdown of the time
spent in each major function:
In [1]: import pstats
In [2]: p = pstats.Stats('output.prof')
In [3]: p.print_stats()
Mon Dec 8 19:43:12 2014
output.prof
5 function calls in 0.000 seconds
Random listing order was used
ncalls
396
|
tottime
percall
cumtime
percall filename:lineno(function)
Chapter 17: Debugging
www.it-ebooks.info
1
1
1
1
1
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
fixed_mean.py:1()
{sum}
fixed_mean.py:1(mean)
{method 'disable' of ...
{len}
A summary of the run. print_stats doesn’t have very fine resolution.
The print_stats function prints the number of calls to each function, the total
time spent in each function, the time spent each time that function was called,
the cumulative time elapsed in the program, and the place in the file where the
call occurs.
This view is more helpful for programs that take longer to run. The many zeros in
this example indicate that the time per function was never higher than 0.0009 sec‐
onds. Since the fixed_mean.py script runs so quickly, pstats does not, by default,
print with fine enough resolution to capture the variable time spent in each function.
By using various configuration options, we can make pstats print with finer resolu‐
tion. That exercise is left up to the reader. A more effective way to view this informa‐
tion is with a graphical interface. We will move along to the next section to learn
more about that option.
Viewing the Profile Graphically
Many more beautiful and detailed ways to view this output exist. One is a program
called RunSnakeRun.
RunSnakeRun is a common graphical interpreter for profiler output from cProfile
and the kernprof tool (which we’ll meet in the next section). With the simple com‐
mand runsnake on the command line, RunSnakeRun opens a GUI for
browsing the profile output. The results from our simple mean function are shown in
Figure 17-1. In RunSnakeRun, the total amount of colored area is the amount of time
spent in the program. Within that box, any calls to functions are shown by the
amount of time spent in them, hierarchically.
Profiling
www.it-ebooks.info
|
397
Figure 17-1. Profiling the mean function with RunSnakeRun
However, that example is not very exciting. For more complicated programs, the
results can be quite interesting, as seen in Figure 17-2.
Figure 17-2. Profiling a more complex script with RunSnakeRun
398
|
Chapter 17: Debugging
www.it-ebooks.info
At the top is a percent button. That button will show a breakdown of the percentage
of time spent in each part of the code. This interactive graphic demonstrates the
behavior of each section of the code so that you can quickly see where time is being
wasted.
Another option, inspired by RunSnakeRun, is an in-browser viewer called SnakeViz.
To use SnakeViz, first make sure it is installed by running which snakeviz. If it is not
present, try installing it with pip (pip install snakeviz) using your package man‐
ager or downloading it from its website. Next, in the command line, type:
$ snakeviz output.prof
The SnakeViz program will cause a web browser to open and will provide an interac‐
tive infographic of the data in output.prof. The results for our simple code are shown
in Figure 17-3.
Figure 17-3. Profiling with SnakeViz
Profiling
www.it-ebooks.info
|
399
With SnakeViz, the execution of the code can be browsed on a function-by-function
basis. The time spent in each function is rendered in radial blocks. The central circle
represents the top of the call stack—that is, the function from which all other func‐
tions are called. In our case, that is the main body of the module in the final four lines
of the file.
The next radial annulus describes the time spent in each function called by the main
function, and so on. When the mouse hovers over some section of the graph, more
information is shown. To learn more about SnakeViz and how to interperet its con‐
tents, see its website.
Combined with cProfile, these graphical interfaces for profiling are an efficient way
to pinpoint functions with efficiency issues. Sometimes, though, it is even more help‐
ful to know how much time you spend on each line. For this, consider kernprof.
Line Profiling with Kernprof
For showing the specific lines at fault for slowdowns, you can use a line profiler called
kernprof. To use kernprof, you must alter the file itself with a decorator (@profile)
above each function definition of interest. The mean code becomes:
With that decorator in place, kernprof can then be run verbosely in line-by-line mode
thus:
kernprof -v -l fixed_mean.py
When kernprof is run in that way, the profile of time spent is printed to the terminal
in much greater detail than with the previous tools:
16.375
Wrote profile results to fixed_mean.py.lprof
Timer unit: 1e-06 s
Total time: 7e-06 s
File: fixed_mean.py
Function: mean at line 1
Line #
Hits
Time Per Hit
% Time Line Contents
==============================================================
1
@profile
2
def mean(nums):
3
1
2
2.0
28.6
top = sum(nums)
4
1
0
0.0
0.0
bot = len(nums)
5
1
5
5.0
71.4
return float(top)/float(bot)
Since the code is run from start to finish, the code output is printed.
kernprof intelligently guesses the magnitude of time resolution to print.
400
|
Chapter 17: Debugging
www.it-ebooks.info
The only profiled lines are those within the function that we decorated.
Each line has its own row in this table.
When you’re inspecting these results, the fifth column is the most important. It indi‐
cates the percentage of time spent on each line in the mean function. The results here
indicate that most of the time is spent calculating and returning the quotient. Perhaps
some speedup can be achieved. Can you think of any simplifications to the code? Try
making a change to determine whether it has an effect on the speed of execution.
Now that our code no longer exhibits errors and can be optimized for speed, the only
remaining debugging task is clean up. A tool used to cleanup code is called a linter.
Linting
Linting removes “lint” from source code. It’s a type of cleanup that is neither debug‐
ging nor testing nor profiling, but can be helpful at each of these stages of the pro‐
gramming process. Linting catches unnecessary imports, unused variables, potential
typos, inconsistent style, and other similar issues.
Linting in Python can be achieved with the pyflakes tool. Get it? Errors are more than
just lint, they’re flakes!
As an example of how to use a linter, recall the elementary.py file from Chapter 6. To
lint a Python program, execute the pyflakes command on it:
$ pyflakes elementary.py
pyflakes responds with a note indicating that a package has been imported but
remains unused throughout the code execution:
elementary.py:2: 'numpy' imported but unused
This information is more than just cosmetic. Since importing packages takes time
and occupies computer memory, reducing unused imports can speed up your code.
That said, most linting tools do focus on cosmetic issues. Style-related linting tools
such as flake8, pep8, or autopep8 can be used to check for errors, variable name mis‐
spelling, and PEP8 compatibility. For more on the PEP8 style standard in Python, see
Chapter 19. To use the pep8 tool, simply call it from the command line:
$ pep8 elementary.py
It will analyze the Python code that you have provided and will respond with a lineby-line listing of stylistic incompatibilities with the PEP8 standard:
elementary.py:4:1: E302 expected 2 blank lines, found 1
elementary.py:5:3: E111 indentation is not a multiple of four
elementary.py:7:31: E228 missing whitespace around modulo operator
Linting
www.it-ebooks.info
|
401
This indicates that the elementary.py file has a few insufficiencies related to the PEP8
Style Guide. The combined information of both tools can be retrieved with the much
more strict pylint tool on the command line:
$ pylint -rn elementary.py
The -rn flag simply tells pylint not to print its full report. The report provided by
pylint by default is quite lengthy indeed and could easily occupy half of the pages in
this chapter:
No config file found, using default configuration
************* Module elementary
W: 5, 0: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
W: 6, 0: Bad indentation. Found 4 spaces, expected 8 (bad-indentation)
W: 7, 0: Bad indentation. Found 4 spaces, expected 8 (bad-indentation)
W: 8, 0: Bad indentation. Found 4 spaces, expected 8 (bad-indentation)
W: 9, 0: Bad indentation. Found 4 spaces, expected 8 (bad-indentation)
C: 1, 0: Missing module docstring (missing-docstring)
C: 6, 4: Invalid attribute name "s" (invalid-name)
C: 7, 4: Invalid attribute name "isFermion" (invalid-name)
C: 8, 4: Invalid attribute name "isBoson" (invalid-name)
C: 4, 0: Missing class docstring (missing-docstring)
W: 5, 2: __init__ method from base class 'Particle' is not called...
Once the incorrect indentation, invalid names, and missing docstrings are fixed, your
code will be ready for prime time.
Debugging Wrap-up
Having read this chapter, you should feel ready to use an interactive debugger to
more efficiently and systematically:
• Understand bugs
• Track down their cause
• Prototype solutions
• Check for success
Additionally, this chapter should have prepared you to use profilers and linters to
optimize and clean your code once you’ve fixed the bugs. Now that you are prepared
to deal with bugs and inefficiencies that arise in your code, your focus can turn to
keeping them from appearing in the first place. In the next chapter, we will show you
how to avoid bugs with comprehensive, systematic testing.
402
|
Chapter 17: Debugging
www.it-ebooks.info
CHAPTER 18
Testing
Before relying on a new experimental device, a good physicist will establish its accu‐
racy. A new detector will always have its responses to known input signals tested. The
results of this calibration are compared against the expected responses. If the device is
trustworthy, then the responses received will fall within acceptable bounds of what
was expected. To make this a fair test, the accuracy bounds are set prior to the test.
The same goes for testing in computational science and software development.
Code is assumed guilty until proven innocent. This applies to software written by
other people, but even more so to software written by yourself. The mechanism that
builds trust that software is performing correctly is called testing.
Testing is the process by which the expected results of code are compared against the
observed results of actually having run that code. Tests are typically provided along
with the code that they are testing. The collection of all of the tests for a given piece of
code is known as the test suite. You can think of the test suite as a bunch of precanned
experiments that anyone can run. If all of the tests pass, then the code is at least parti‐
ally trustworthy. If any of the tests fail, then the code is known to be incorrect with
respect to whichever case failed.
Now, you may have noticed that the test code itself is part of the software package.
Since the tests are just as likely to have bugs as the code they are testing, it is tempting
to start writing tests that test the tests. However, this quickly runs into an incomplete‐
ness problem. There is no set of tests that is the set of all possible tests. Suppose you
write a test suite for some code. Now your test suite is untested, so you add a test for
the test suite. Now your test suite tester is untested, so you write a test for that, and so
on. It is possible to escape this infinite-work trap using recursion, as discussed in
Chapter 5, but it probably is not worth your time.
403
www.it-ebooks.info
Even one level of testing—just testing main code and not the tests themselves—is
incredibly beneficial. Almost all of the scientific value comes from this first pass. This
is because the first level is where the physics is put directly in question. A sufficiently
rigorous test suite will find all of the physical and computational errors without hav‐
ing to worry about the philosophical and mathematical ramifications of whether a
test is itself sufficiently tested.
Testing is so central to both the scientific method and modern software development
that many computational scientists consider it a moral failing for a scientific program
not to include tests. They also know to not trust a code when the tests do not pass.
Neither should you. For software that you do not write, it is always a good idea to run
the test suite when you first start working with the code. The documentation will typ‐
ically include instructions on how to run the tests, since they can be different from
project to project.
In this chapter, we will be discussing testing in the context of Python. Specifically, we
will be using the nose testing framework. This is a package of tools that make writing
and running tests easy. Though other test frameworks exist in Python (pytest, uni
ttest), nose has become the standard testing tool for scientific Python. It helps that it
is also easier to use and understand than some of the others.
We start this chapter by asking a series of questions that illuminate good testing prac‐
tices that everyone should follow.
Why Do We Test?
Testing is a great practice that aids all of software development. However, not practic‐
ing good habits alone is not a moral failing. Testing is considered a core principle of
scientific software because its impact is at the heart of knowledge generation.
In most other programming endeavors, if code is fundamentally wrong, even if it
goes uncorrected for years at a time, the impact of this error can be relatively small.
Perhaps a website goes down, or a game crashes, or a day’s worth of writing is lost
when the computer crashes. Scientific code, on the other hand, controls planes, weap‐
ons systems, satellites, agriculture, and (most importantly) physics simulations and
experiments. If the software that governs a computational or physical experiment is
wrong, then any decisions that are made based on its results will be completely
untrustworthy.
This is not to say that physicists have a monopoly on software testing. Arguably, test‐
ing is just as important in arenas such as finance, government, and health care. Gross
failures in these areas, however, tend to affect lives and livelihoods rather than knowl‐
edge itself.
404
|
Chapter 18: Testing
www.it-ebooks.info
We would like to think that scientists are rigorous enough to realize the importance
of testing, but mistakes of negligence happen all too frequently. Everyone who has
been involved with scientific software for any length of time has a horror story or
two. The truth of the matter is that most scientists are poorly equipped to truly test
their code. The average blog or image-sharing website is better tested than most sci‐
entific software.
This chapter is here to help remedy the poor testing situation by explaining the moti‐
vation behind testing and giving you the tools you need to do better.
When Should We Test?
Always.
Testing should be a seamless part of the scientific software development process. Tests
should be created along with the code that is to be tested. At the very minimum, at
least one test should be written immediately following the initial implementation of a
function or a class. At the beginning of a new project, tests can be used to help guide
the overall architecture of the project. This is analogous to experiment design in the
experimental science world. The act of writing tests can help clarify how the software
should be performing. Taking this idea to the extreme, you could start to write the
tests before you even write the software that will be tested. We will discuss this prac‐
tice in greater detail in “Test-Driven Development” on page 419.
In Working Effectively with Legacy Code (Prentice Hall), Michael Feathers defines leg‐
acy code as “any code without tests.” This definition draws on the fact that after its
initial creation, tests provide a powerful guide to other developers (and to your for‐
getful self, a few months in the future) about how each function in a piece of code is
meant to be used. Without runnable tests to provide examples of code use, even
brand new programs are unsustainable.
Where Should We Write Tests?
While writing code, you can add exceptions and assertions to sound an alarm as run‐
time problems come up. These kinds of tests, however, are embedded in the software
itself. It is better to separate the code from the tests as much as possible. External tests
require a bit more sophistication, but are better suited to checking the implementa‐
tion against its expected behavior. These external tests are what is referred to as the
test suite. The runtime exceptions and assertions do not count as part of the test suite.
Many projects choose to have a top-level directory named after the project or called
src/. Similarly, many projects also have a top-level tests/ directory where the test suite
lives. This often mirrors the directory structure of the source directory. Mirroring
makes it obvious where the test lives for any corresponding piece of source code.
When Should We Test?
www.it-ebooks.info
|
405
Alternatively, some projects choose to place the tests right next to the source code
that they are testing. Say you had a module called physics.py. In this schema, the tests
would live in a test_physics.py file to keep them somewhat separate. This strategy is
not recommended, though you will sometimes encounter it in the wild.
As with everything in software, the most important aspect of where to put the tests is
to be consistent. Choose one approach and follow that for all of the tests that you write
in a given project. If you are working on a more established project, be sure to con‐
form to whatever pattern was set before you started.
What and How to Test?
Consider again the analogy to a detector in a physical experiment. The behavior of
the detector must be characterized and calibrated across the valid range of interest.
However, it is often unnecessary to characterize the response to every possible valid
input. Most detectors rely on the physical quantity that they measure being either
continuous or discrete. Testing only a few key signals, typically at the upper and lower
edges of its range and some points in between, is enough to determine if and how
well the machine is working. This “test what is important” mindset applies equally to
scientific software development. Software tests should cover behavior from the com‐
mon to the extreme, but not every single value within those bounds.
Let’s see how this mindset applies to an actual physics problem. Given two previous
observations in the sky and the time between them, Kepler’s Laws provide a closedform equation for the future location of a celestial body. This can be implemented via
a function named kepler_loc(). The following is a stub interface representing this
function that lacks the actual function body:
def kepler_loc(p1, p2, dt, t):
...
return p3
As a basic test of this function, we can take three points on the planet Jupiter’s actual
measured path and use the latest of these as the expected result. We will then compare
this to the result that we observe as the output of the kepler_loc() function.
Tests compare expected outputs versus observed outputs for
known inputs. They do not inspect the body of the function
directly. In fact, the body of a function does not even have to exist
for a valid test to be written.
To start testing, we will raise an exception as a way of signaling that the test failed if
the expected value is not equal to the observed value. Frequently, tests are written as
functions that have the same name as the code that they are testing with the word
406
|
Chapter 18: Testing
www.it-ebooks.info
test either before or after it. The following example is pseudocode for testing that the
measured positions of Jupiter, given by the function jupiter(), can be predicted with
the kepler_loc() function:
def test_kepler_loc():
p1 = jupiter(two_days_ago)
p2 = jupiter(yesterday)
exp = jupiter(today)
obs = kepler_loc(p1, p2, 1, 1)
if exp != obs:
raise ValueError("Jupiter is not where it should be!")
The test_kepler_loc() function tests kepler_loc().
Get the inputs to kepler_loc().
Obtain the expected result from experimental data.
Obtain the observed result by calling kepler_loc().
Test that the expected result is the same as the observed result. If it is not, signal
that the test failed by raising an exception.
Now, calling the test_kepler_loc() function will determine whether kepler_loc()
is behaving as intended. If a ValueError is raised, then we know something is wrong.
The test_kepler_loc() function follows a very common testing pattern:
1. Name the test after the code that is being tested.
2. Load or set the expected results.
3. Compute the observed result by actually running the code.
4. Compare the expected and observed results to ensure that they are equivalent.
This pattern can be boiled down to the following pseudocode:
def test_func():
exp = get_expected()
obs = func(*args, **kwargs)
assert exp == obs
It is critical to understand that tests should usually check for equivalence (==) and not
equality (is). It is more important that the expected and observed results are effec‐
tively the same than that they are actually the same exact object in memory. For the
floating-point data that is common in physics, it is often more pertinent for the
expected and observed results to be approximately equal than it is for them to have
precisely the same value. Floats are an approximation, and this needs to be accounted
for when you’re testing.
What and How to Test?
www.it-ebooks.info
|
407
Testing equivalence via exceptions is rather like hammering a nail with a grenade.
The nail will probably go in (the test will run), but the grenade will take everything
else (i.e., the Python interpreter) along with it. A slightly more subtle way to accom‐
plish the same task would be to use assertions. From Table 2-1, recall that an assert
statement in Python ensures that the expression following it evaluates to True. If the
assertion is true, then Python continues on its merry way. If the assertion is false, then
an AssertionError is raised. We could rewrite test_keppler_loc() as follows:
def test_keppler_loc():
p1 = jupiter(two_days_ago)
p2 = jupiter(yesterday)
exp = jupiter(today)
obs = keppler_loc(p1, p2, 1, 1)
assert exp == obs
Now with an assertion instead of an exception.
The assertion approach still lacks subtlety, though all that we know when the test fails
is that it failed. We do not see the values of the expected and observed results to help
us determine where the fault lies. To get this kind of extra information in the event of
a failure, we need to supply a custom assertion. Rich and descriptive assertions are
exactly what a test framework like nose provides.
nose has a variety of helpful and specific assertion functions that display extra debug‐
ging information when they fail. These are all accessible through the nose.tools
module. The simplest one is named assert_equal(). It takes two arguments, the
expected and observed results, and checks them for equivalence (==). We can further
rewrite test_kepler_loc() as seen here:
from nose.tools import assert_equal
def test_kepler_loc():
p1 = jupiter(two_days_ago)
p2 = jupiter(yesterday)
exp = jupiter(today)
obs = keppler_loc(p1, p2, 1, 1)
assert_equal(exp, obs)
To obtain functionality from nose, first we have to import it.
Python’s assertion can be replaced with nose’s.
Using the test framework is the best way to write tests. Executing each of your tests by
hand, however, becomes tiresome when you have more than a handful in your test
suite. The next section goes over how to manage all of the tests you have written.
408
|
Chapter 18: Testing
www.it-ebooks.info
Running Tests
The major boon a testing framework provides is a utility to find and run the tests
automatically. With nose, this is a command-line tool called nosetests. When nosetests
is run, it will search all the directories whose names start or end with the word test,
find all of the Python modules in these directories whose names start or end with test,
import them, and run all of the functions and classes whose names start or end with
test. In fact, nose looks for any names that match the regular expression (?:^|[\\b_\
\.-])[Tt]est. This automatic registration of test code saves tons of human time and
allows us to focus on what is important: writing more tests.
When you run nosetests, it will print a dot (.) on the screen for every test that passes,
an F for every test that fails, and an E for every test where there was an unexpected
error. In rarer situations you may also see an S indicating a skipped test (because the
test is not applicable on your system) or a K for a known failure (because the develop‐
ers could not fix it promptly). After the dots, nosetests will print summary informa‐
tion. Given just the one test_kepler_loc() test from the previous section, nosetests
would produce results like the following:
$ nosetests
.
Ran 1 test in 0.224s
OK
As we write more code, we would write more tests, and nosetests would produce more
dots. Each passing test is a small, satisfying reward for having written quality scien‐
tific software. Now that you know how to write tests, let’s go into what can go wrong.
Edge Cases
What we saw in “What and How to Test?” on page 406 is called an interior test. The
precise points that we tested did not matter. Any two initial points in an orbit could
have been used to predict further positions. Though this is not as true for cyclic prob‐
lems, more linear scenarios tend to have a clear beginning, middle, and end. The out‐
put is defined on a valid range.
The situation where the test examines either the beginning or the end of a range, but
not the middle, is called an edge case. In a simple, one-dimensional problem, the two
edge cases should always be tested along with at least one internal point. This ensures
that you have good coverage over the range of values.
Anecdotally, it is important to test edges cases because this is where errors tend to
arise. Qualitatively different behavior happens at boundaries. As such, they tend to
Running Tests
www.it-ebooks.info
|
409
have special code dedicated to them in the implementation. Consider the following
simple Fibonacci function:
def fib(n):
if n == 0 or n == 1:
return 1
else:
return fib(n - 1) + fib(n - 2)
This function has two edge cases: zero and one. For these values of n, the fib() func‐
tion does something special that does not apply to any other values. Such cases
should be tested explicitly. A minimally sufficient test suite for this function would
be:
from nose.tools import assert_equal
from mod import fib
def test_fib0():
# test edge 0
obs = fib(0)
assert_equal(1, obs)
def test_fib1():
# test edge 1
obs = fib(1)
assert_equal(1, obs)
def test_fib6():
# test regular point
obs = fib(6)
assert_equal(13, obs)
Test the edge case for zero.
Test the edge case for one.
Test an internal point.
Different functions will have different edge cases. Often, you need not test for cases
that are outside the valid range, unless you want to test that the function fails. In the
fib() function negative and noninteger values are not valid inputs. You do not need
to have tests for these classes of numbers, though it would not hurt. Edge cases are
not where the story ends, though, as we will see next.
Corner Cases
When two or more edge cases are combined, it is called a corner case. If a function is
parametrized by two independent variables, a test that is at the extreme of both vari‐
410
|
Chapter 18: Testing
www.it-ebooks.info
ables is in a corner. As a demonstration, consider the case of the function (sin(x) /
x) * (sin(y) / y), presented here:
import numpy as np
def sinc2d(x, y):
if x == 0.0 and y == 0.0:
return 1.0
elif x == 0.0:
return np.sin(y) / y
elif y == 0.0:
return np.sin(x) / x
else:
return (np.sin(x) / x) * (np.sin(y) / y)
The function sin(x)/x is called the sinc() function. We know that at the point
where x = 0, then sinc(x) == 1.0. In the code just shown, sinc2d() is a twodimensional version of this function. When both x and y are zero, it is a corner case
because it requires a special value for both variables. If either x or y (but not both) is
zero, these are edge cases. If neither is zero, this is a regular internal point.
A minimal test suite for this function would include a separate test for the corner
case, each of the edge cases, and an internal point. For example:
import numpy as np
from nose.tools import assert_equal
from mod import sinc2d
def test_internal():
exp = (2.0 / np.pi) * (-2.0 / (3.0 * np.pi))
obs = sinc2d(np.pi / 2.0, 3.0 * np.pi / 2.0)
assert_equal(exp, obs)
def test_edge_x():
exp = (-2.0 / (3.0 * np.pi))
obs = sinc2d(0.0, 3.0 * np.pi / 2.0)
assert_equal(exp, obs)
def test_edge_y():
exp = (2.0 / np.pi)
obs = sinc2d(np.pi / 2.0, 0.0)
assert_equal(exp, obs)
def test_corner():
exp = 1.0
obs = sinc2d(0.0, 0.0)
assert_equal(exp, obs)
Test an internal point.
Edge Cases
www.it-ebooks.info
|
411
Test an edge case for x and internal for y.
Test an edge case for y and internal for x.
Test the corner case.
Corner cases can be even trickier to find and debug than edge cases because of their
increased complexity. This complexity, however, makes them even more important to
explicitly test.
Whether internal, edge, or corner cases, we have started to build up a classification
system for the tests themselves. In the following sections, we will build this system up
even more based on the role that the tests have in the software architecture.
Unit Tests
All of the tests that we have seen so far have been unit tests. They are so called
because they exercise the functionality of the code by interrogating individual func‐
tions and methods. Functions and methods can often be considered the atomic units
of software because they are indivisible from the outside.
However, what is considered to be the smallest code unit is subjective. The body of a
function can be long or short, and shorter functions are arguably more unit-like than
long ones. Thus, what reasonably constitutes a code unit typically varies from project
to project and language to language. A good rule of thumb is that if the code cannot
be made any simpler logically (you cannot split apart the addition operator) or practi‐
cally (a function is self-contained and well defined), then it is a unit. The purpose
behind unit tests is to encourage both the code and the tests to be as small, welldefined, and modular as possible. There is no one right answer for what this means,
though. In Python, unit tests typically take the form of test functions that are auto‐
matically called by the test framework.
Additionally, unit tests may have test fixtures. A fixture is anything that may be added
to the test that creates or removes the environment required by the test to successfully
run. They are not part of expected result, the observed result, or the assertion. Test
fixtures are completely optional.
A fixture that is executed before the test to prepare the environment is called a setup
function. One that is executed to mop up side effects after a test is run is called a
teardown function. nose has a decorator that you can use to automatically run fix‐
tures no matter whether the test succeeded, failed, or had an error. (For a refresher on
decorators, see “Decorators” on page 112.)
Consider the following example that could arise when communicating with thirdparty programs. You have a function f() that will write a file named yes.txt to disk
412
|
Chapter 18: Testing
www.it-ebooks.info
with the value 42 but only if a file no.txt does not exist. To truly test that the function
works, you would want to ensure that neither yes.txt nor no.txt existed before you ran
your test. After the test, you would want to clean up after yourself before the next test
comes along. You could write the test, setup, and teardown functions as follows:
import os
from nose.tools import assert_equal, with_setup
from mod import f
def f_setup():
files = os.listdir('.')
if 'no.txt' in files:
os.remove('no.txt')
if 'yes.txt' in files:
os.remove('yes.txt')
def f_teardown():
files = os.listdir('.')
if 'yes.txt' in files:
os.remove('yes.txt')
def test_f():
f_setup()
exp = 42
f()
with open('yes.txt', 'r') as fhandle:
obs = int(fhandle.read())
assert_equal(exp, obd)
f_teardown()
The f_setup() function tests ensure that neither the yes.txt nor the no.txt file
exists.
The f_teardown() function removes the yes.txt file, if it was created.
The first action of test_f() is to make sure the filesystem is clean.
The last action of test_f() is to clean up after itself.
This implementation of test fixtures is usually fine. However, it does not guarantee
that the f_setup() and f_teardown() functions will be called. This is because an
unexpected error anywhere in the body of f() or test_f() will cause the test to abort
before the teardown function is reached. To make sure that both of the fixtures will be
executed, you must use nose’s with_setup() decorator. This decorator may be
applied to any test and takes a setup and a teardown function as possible arguments.
We can rewrite test_f() to be wrapped by with_setup(), as follows:
Unit Tests
www.it-ebooks.info
|
413
@with_setup(setup=f_setup, teardown=f_teardown)
def test_f():
exp = 42
f()
with open('yes.txt', 'r') as fhandle:
obs = int(fhandle.read())
assert_equal(exp, obd)
Note that if you have functions in your test module that are simply named setup()
and teardown(), each of these is called automatically when the entire test module is
loaded in and finished.
Simple tests are the easiest to write. For this reason, functions
should be small enough that they are easy to test. For more infor‐
mation on writing code that facilitates tests, we recommend Robert
C. Martin’s book Clean Code (Prentice Hall).
Having introduced the concept of unit tests, we can now go up a level in complexity.
Integration Tests
You can think of a software project like a clock. Functions and classes are the gears
and cogs that make up the system. On their own, they can be of the highest quality.
Unit tests verify that each gear is well made. However, the clock still needs to be put
together. The gears need to fit with one another.
Integration tests are the class of tests that verify that multiple moving pieces of the
code work well together. They ensure that the clock can tell time correctly. They look
at the system as a whole or at subsystems. Integration tests typically function at a
higher level conceptually than unit tests. Thus, programming integration tests also
happens at a higher level.
Because they deal with gluing code together, there are typically fewer integration tests
in a test suite than there are unit tests. However, integration tests are no less impor‐
tant. Integration tests are essential for having adequate testing. They encompass all of
the cases that you cannot hit through plain unit testing.
Sometimes, especially in probabilistic or stochastic codes, the precise behavior of an
integration test cannot be determined beforehand. That is OK. In these situations it is
acceptable for integration tests to verify average or aggregate behavior rather than
exact values. Sometimes you can mitigate nondeterminism by saving seed values to a
random number generator, but this is not always going to be possible. It is better to
have an imperfect integration test than no integration test at all.
414
|
Chapter 18: Testing
www.it-ebooks.info
As a simple example, consider the three functions a(), b(), and c(). The a() func‐
tion adds one to a number, b() multiplies a number by two, and c() composes them.
These functions are defined as follows:
def a(x):
return x + 1
def b(x):
return 2 * x
def c(x):
return b(a(x))
The a() and b() functions can each be unit-tested because they each do one thing.
However, c() cannot be truly unit tested because all of the real work is farmed out to
a() and b(). Testing c() will be a test of whether a() and b() can be integrated
together.
Integration tests still follow the pattern of comparing expected results to observed
results. A sample test_c() is implemented here:
from nose.tools import assert_equal
from mod import c
def test_c():
exp = 6
obs = c(2)
assert_equal(exp, obs)
Given the lack of clarity in what is defined as a code unit, what is considered an inte‐
gration test is also a little fuzzy. Integration tests can range from the extremely simple
(like the one just shown) to the very complex. A good delimiter, though, is in opposi‐
tion to the unit tests. If a function or class only combines two or more unit-tested
pieces of code, then you need an integration test. If a function implements new
behavior that is not otherwise tested, you need a unit test.
The structure of integration tests is very similar to that of unit tests. There is an
expected result, which is compared against the observed value. However, what goes in
to creating the expected result or setting up the code to run can be considerably more
complicated and more involved. Integration tests can also take much longer to run
because of how much more work they do. This is a useful classification to keep in
mind while writing tests. It helps separate out which tests should be easy to write
(unit) and which ones may require more careful consideration (integration).
Integration tests, however, are not the end of the story.
Integration Tests
www.it-ebooks.info
|
415
Regression Tests
Regression tests are qualitatively different from both unit and integration tests.
Rather than assuming that the test author knows what the expected result should be,
regression tests look to the past. The expected result is taken as what was previously
computed for the same inputs. Regression tests assume that the past is “correct.” They
are great for letting developers know when and how a code base has changed. They
are not great for letting anyone know why the change occurred. The change between
what a code produces now and what it computed before is called a regression.
Like integration tests, regression tests tend to be high level. They often operate on an
entire code base. They are particularly common and useful for physics simulators.
A common regression test strategy spans multiple code versions. Suppose there is an
input file for version X of a simulator. We can run the simulation and then store the
output file for later use, typically somewhere accessible online. While version Y is
being developed, the test suite will automatically download the output for version X,
run the same input file for version Y, and then compare the two output files. If any‐
thing is significantly different between them, the test fails.
In the event of a regression test failure, the onus is on the current developers to
explain why. Sometimes there are backward-incompatible changes that had to be
made. The regression test failure is thus justified, and a new version of the output file
should be uploaded as the version to test against. However, if the test fails because the
physics is wrong, then the developer should fix the latest version of the code as soon
as possible.
Regression tests can and do catch failures that integration and unit tests miss. Regres‐
sion tests act as an automated short-term memory for a project. Unfortunately, each
project will have a slightly different approach to regression testing based on the needs
of the software. Testing frameworks provide tools to help with building regression
tests but do not offer any sophistication beyond what has already been seen in this
chapter.
Depending on the kind of project, regression tests may or may not be needed. They
are only truly needed if the project is a simulator. Having a suite of regression tests
that cover the range of physical possibilities is vital to ensuring that the simulator still
works. In most other cases, you can get away with only having unit and integration
tests.
While more test classifications exist for more specialized situations, we have covered
what you will need to know for almost every situation in computational physics. In
the following sections, we will go over how to write tests more effectively.
416
|
Chapter 18: Testing
www.it-ebooks.info
Test Generators
Test generators automate the creation of tests. Suppose that along with the function
you wished to test, you also had a long list of expected results and the associated
arguments to the function for those expected results. Rather than you manually creat‐
ing a test for each element of the list, the test generator would take the list and manu‐
facture the desired tests. This requires much less work on your part while also
providing more thorough testing. The list of expected results and function inputs is
sometimes called a test matrix.
In nose, test generators are written by turning the test function into a generator with
yield statements.1 In the test function, the assertion for each element of the matrix is
yielded, along with the expected value and the function inputs. Corresponding check
functions sometimes go along with the test generator to perform the actual work.
For demonstration purposes, take a simple function that adds two numbers together.
The function, the check function, and the test generator could all be written as
follows:
from nose.tools import assert_equal
def add2(x, y):
return x + y
def check_add2(exp, x, y):
obs = add2(x, y)
assert_equal(exp, obs)
def test_add2():
cases = [
(4, 2, 2),
(5, -5, 10),
(42, 40, 2),
(16, 3, 13),
(-128, 0, -128),
]
for exp, x, y in cases:
yield check_add2, exp, x, y
The function to test, add2().
The check function performs the equality assertion instead of the test.
The test function is now a test generator.
1 See “Generators” on page 109 for a refresher on generators if you need one.
Test Generators
www.it-ebooks.info
|
417
cases is a list of tuples that represents the test matrix. The first element of each
tuple is the expected result. The following elements are the arguments to add2().
Looping through the test matrix cases, we yield the check function, the
expected value, and the add2() arguments. Nose will count each yield as a sepa‐
rate full test.
This will produce five tests in nose, one for each case. We can therefore efficiently
create many tests and minimize the redundant code we need to write. Running
nosetests will produce the following output:
$ nosetests
.....
Ran 5 tests in 0.001s
OK
This is a very powerful testing mechanism because adding or removing tests is as easy
as modifying the cases list. Different testing frameworks implement this idea in dif‐
ferent ways. In all frameworks, it makes your life easier. Generating many test cases
will hopefully cover more of the code base. The next section will discuss how to
determine how many lines of your project are actually being executed by the test
suite.
Test Coverage
The term test coverage is often used to mean the percentage of the code for which an
associated test exists. You can measure this by running the test suite and counting the
number of lines of code that were executed and dividing this by the total number of
lines in the software project. If you have the coverage Python project installed (pip
install coverage ), you can run nose and generate coverage statistics simultaneously
via the --with-coverage switch at the command line:
$ nosetests --with-coverage
At first glance this metric seems like a useful indicator of code reliability. But while
some test coverage is superior to none and broad test coverage is usually superior to
narrow coverage, this metric should be viewed critically. All code should ideally have
100% test coverage, but this alone does not guarantee that the code works as
intended. Take the following pseudocode for a function g() shown here, with two ifelse statements in its body:
def g(x, y):
if x:
...
else:
418
|
Chapter 18: Testing
www.it-ebooks.info
...
if y:
...
else:
...
return ...
The following two unit tests for g() have 100% coverage:
from nose.tools import assert_equal
from mod import g
def test_g_both_true():
exp = ...
obs = g(True, True)
assert_equal(exp, obs)
def test_g_both_false():
exp = ...
obs = g(False, False)
assert_equal(exp, obs)
Every line of g() is executed by these two functions. However, only half of the possi‐
ble cases are covered. We are not testing when x=True and y=False or when x=False
and y=True. In this case, 100% coverage is only 50% of the possible code path combi‐
nations. In full software projects, 100% coverage is achieved with much less than 50%
of the code paths been executed.
Code coverage is an important and often cited measure. However, it is not the
pinnacle of testing. It is another tool in your testing toolbox. Use it as needed and
understand its limitations.
The next section covers another tool, but one that changes the testing strategy itself.
Test-Driven Development
Test-driven development (TDD) takes the workflow of writing code and writing tests
and turns it on its head. TDD is a software development process where you write the
tests first. Before you write a single line of a function, you first write the test for that
function.
After you write a test, you are then allowed to proceed to write the function that you
are testing. However, you are only supposed to implement enough of the function so
that the test passes. If the function does not do what is needed, you write another test
and then go back and modify the function. You repeat this process of test-thenimplement until the function is completely implemented for your current needs.
Test-Driven Development
www.it-ebooks.info
|
419
Developers who practice strict TDD will tell you that it is the best thing since sliced
arrays. The central claim to TDD is that at the end of the process you have an imple‐
mentation that is well tested for your use case, and the process itself is more efficient.
You stop when your tests pass and you do not need any more features. You do not
spend any time implementing options and features on the off chance that they will
prove helpful later. You get what you need when you need it, and no more. TDD is a
very powerful idea, though it can be hard to follow religiously.
The most important takeaway from test-driven development is that the moment you
start writing code, you should be considering how to test that code. The tests should
be written and presented in tandem with the implementation. Testing is too impor‐
tant to be an afterthought.
Whether to pursue classic TDD is a personal decision. This design philosophy was
most strongly put forth by Kent Beck in his book Test-Driven Development: By Exam‐
ple. The following example illustrates TDD for a standard deviation function, std().
To start, we write a test for computing the standard deviation from a list of numbers
as follows:
from nose.tools import assert_equal
from mod import std
def test_std1():
obs = std([0.0, 2.0])
exp = 1.0
assert_equal(obs, exp)
Next, we write the minimal version of std() that will cause test_std1() to pass:
def std(vals):
# surely this is cheating...
return 1.0
As you can see, the minimal version simply returns the expected result for the sole
case that we are testing. If we only ever want to take the standard deviation of the
numbers 0.0 and 2.0, or 1.0 and 3.0, and so on, then this implementation will work
perfectly. If we want to branch out, then we probably need to write more robust code.
However, before we can write more code, we first need to add another test or two:
def test_std1():
obs = std([0.0, 2.0])
exp = 1.0
assert_equal(obs, exp)
def test_std2():
obs = std([])
exp = 0.0
assert_equal(obs, exp)
420
| Chapter 18: Testing
www.it-ebooks.info
def test_std3():
obs = std([0.0, 4.0])
exp = 2.0
assert_equal(obs, exp)
Test the fiducial case when we pass in an empty list.
Test a real case where the answer is not one.
A perfectly valid standard deviation function that would correspond to these three
tests passing would be as follows:
def std(vals):
# a little better
if len(vals) == 0:
return 0.0
return vals[-1] / 2.0
Special case the empty list.
By being clever, we can get away without doing real work.
Even though the tests all pass, this is clearly still not a generic standard deviation
function. To create a better implementation, TDD states that we again need to expand
the test suite:
def test_std1():
obs = std([0.0, 2.0])
exp = 1.0
assert_equal(obs, exp)
def test_std2():
obs = std([])
exp = 0.0
assert_equal(obs, exp)
def test_std3():
obs = std([0.0, 4.0])
exp = 2.0
assert_equal(obs, exp)
def test_std4():
obs = std([1.0, 3.0])
exp = 1.0
assert_equal(obs, exp)
def test_std5():
obs = std([1.0, 1.0, 1.0])
exp = 0.0
assert_equal(obs, exp)
Test-Driven Development
www.it-ebooks.info
|
421
The first value is not zero.
Here, we have more than two values, but all of the values are the same.
At this point, we may as well try to implement a generic standard deviation function.
We would spend more time trying to come up with clever approximations to the
standard deviation than we would spend actually coding it. Just biting the bullet, we
might write the following implementation:
def std(vals):
# finally, some math
n = len(vals)
if n == 0:
return 0.0
mu = sum(vals) / n
var = 0.0
for val in vals:
var = var + (val - mu)**2
return (var / n)**0.5
It is important to note that we could improve this function by writing further tests.
For example, this std() ignores the situation where infinity is an element of the val‐
ues list. There is always more that can be tested. TDD prevents you from going over‐
board by telling you to stop testing when you have achieved all of your use cases.
Testing Wrap-up
Testing is one of the primary concerns of scientific software developers. It is a techni‐
cal solution to a philosophical problem. You should now be familiar with the follow‐
ing concepts in testing:
• Tests compare that the result observed from running code is the same as what
was expected ahead of time.
• Tests should be written at the same time as the code they are testing is written.
• The person best suited to write a test is the author of the original code.
• Tests are grouped together in a test suite.
• Test frameworks, like nose, discover and execute tests for you automatically.
• An edge case is when an input is at the limit of its range.
• A corner case is where two or more edge cases meet.
• Unit tests try to test the smallest pieces of code possible, usually functions and
methods.
• Integration tests make sure that code units work together properly.
• Regression tests ensure that everything works the same today as it did yesterday.
422
|
Chapter 18: Testing
www.it-ebooks.info
• Test generators can be used to efficiently check many cases.
• Test coverage is the percentage of the code base that is executed by the test suite.
• Test-driven development says to write your tests before you write the code that is
being tested.
You should now know how to write software and how to follow the best practices that
make software both useful and great. In the following chapters we will go over how
you can let the world know about the wonderful things that you have done.
Testing Wrap-up
www.it-ebooks.info
|
423
www.it-ebooks.info
PART IV
Getting It Out There
www.it-ebooks.info
www.it-ebooks.info
CHAPTER 19
Documentation
Computational science is a special case of scientific research: the work is easily shared
via the Internet since the paper, code, and data are digital and those three aspects are
all that is required to reproduce the results, given sufficient computation tools.
—Victoria Stodden, “The Scientific
Method in Practice: Reproducibility in
the Computational Sciences”
Scientists are nomads. As students, they contribute to a piece of research for no more
than four years at a time. As post-docs, their half-life on a project is even shorter.
They disappear after three years, maximum. Even once they settle down as faculty or
laboratory scientists, their workforce is composed primarily of these fly-by-night
individuals. As such, research work in laboratories and universities occurs on a time
scale rarely longer than the tenure of a typical PhD student.
In this environment, it is very common for scientists to crank out a piece of code as
quickly as possible, squeeze a few publications out of it, and disappear to lands
unknown. One victim in all of this is the student or researcher that follows them,
seeking to extend their work. Since the first scientist working on a project valued
speed over sustainability, the second researcher inherits a piece of code with no docu‐
mentation. Accordingly, the original work, often termed “legacy code,” seems to be
understood only by its author. The new contributors to such projects often think to
themselves that rewriting the code from scratch would be easier than deciphering the
enigmas before them. The cycle, of course, repeats itself.
Why Prioritize Documentation?
Chronic inefficiency permeates this situation, fundamentally disrupting the forward
progression of science. In her paper “Better Software, Better Research,” Professor
Carole Goble relates a favorite tweet on the topic:
427
www.it-ebooks.info
One of my favorite #overlyhonestmethods tweets (a hashtag for lab
scientists) is Ian Holmes’s “You can download our code from the
URL supplied. Good luck downloading the only postdoc who can
get it to run, though.”
Though the original tweet was intended as satire, it’s almost too true to be funny. The
status quo needs to change. Thankfully, there is hope. The whole culture of science
does not adhere to this unfortunate state of affairs out of ill will or malice. It’s all a
simple misunderstanding—namely, that “Documentation is not worth the time it
takes.”
This chapter will explain why this statement is so wrong.
Documentation Is Very Valuable
The first false premise behind this statement is that documentation is not valuable.
The truth is that documentation is valuable enough to be a top priority, almost irre‐
spective of how much time it takes to generate it. Its value is paramount because:
• The value and extent of your work is clearer if it can be understood by colleagues.
• Documentation provides provenance for your scientific process, for your collea‐
gues and yourself.
• Documentation demonstrates your skill and professionalism.
Other people will interact with your code primarily through its documentation. This
is where you communicate the value and intent of your research work. However, the
documentation serves as more than an advertisement to your colleagues. It guides the
interest of those who might desire to comment on your work or collaborate with you
on extensions to it. Somewhat cynically, in this way documentation is superior to
modern archival publications, which rarely contain enough detail to fully reproduce
work. Rather, they provide enough information to allow informed critique of the
methods and serve, frankly, to publicize your efforts as a scientist.
In a similar vein, documentation provides provenance for your scientific procedure.
That is, documentation is worthwhile because it preserves a record of your thought
process. This becomes indispensable as time passes and you inevitably forget how
your code works—just in time for a journal editor to ask about the features of the
results. Rather than having to frantically reread the code in the hopes of stumbling
upon its secrets, you’ll have the documentation there to remind you of the equations
you were implementing, the links to the journal articles that influenced your algo‐
rithm, and everything else that would, were this bench science, certainly be recorded
in a laboratory notebook.
428
|
Chapter 19: Documentation
www.it-ebooks.info
Documentation also acts as a demonstration of your skill and professionalism. Stating
you have a piece of code is one thing, but without documentation, it will be difficult
to demonstrate that this code is a professionally developed, polished piece of work
that can be used by others. Furthermore, since most scientists labor under the false
assumption that documentation is difficult and time-consuming to write, they will be
all the more impressed with your efficiency.
Of course, they’re wrong. Documentation is relatively easy; it can even be automated
in many cases.
Documentation Is Easier Than You Think
The second false premise behind the idea that documentation isn’t worth the effort is
that writing documentation takes a lot of time. This is wrong for two reasons:
• Documentation pays for itself with the time it saves in the long run.
• Documentation requires little effort beyond writing the software itself.
Any time you spend on documentation will pay for itself with the time it will save in
the long run. New users need either documentation or hand-holding, but handholding does not scale. Documentation, on the other hand, scales majestically. Funda‐
mentally, if something is written down, it will never need to be explained again. All
questions about how the software works can now be redirected to the user manual.
Your brain, then, remains free for something else. Well-documented code is some‐
what self-maintaining, because when someone new comes along to use your code, the
documentation does the work of guiding them so you don’t have to.
Even disregarding future time savings, producing documentation takes little effort
beyond writing the software itself. Documentation can be easily streamlined into the
programming workflow so that updates aren’t a separate task. For every modern pro‐
gramming language, there is a framework for automatically generating a user manual
based on well-formed comments in the source code (see “Automation” on page 436).
These frameworks minimize the effort on the part of the developer and help to
ensure that the documentation is always up to date, since it is version controlled right
alongside the code. Additionally, the necessity for comments can be reduced with use
of standardized style guides, descriptive variable naming, and concise functions.
Types of Documentation
Documentation comes in myriad forms. Each has its own purpose, benefits, and
drawbacks. A single project may have all, some, or none of the following types of doc‐
umentation. Ideally, they all work together or at least exhibit some separation of con‐
cerns. Types of documentation often encountered in research software include:
Types of Documentation
www.it-ebooks.info
|
429
• Theory manuals
• User and developer guides
• Code comments
• Self-documenting code
• Generated API documentation
We’ll look at each of these, beginning with the one academics are typically most
familiar with: the theory manual.
Theory Manuals
In the universe of academic and research science, the theory manual most often takes
the form of a dissertation describing the theoretical foundations of the code base that
existed on the day of the defense. Depending on the origin of the code and the career
stage of the lead developer, the theory manual can also take the form of a series of
white papers, journal articles, or internal reports. Whatever the case may be, a theory
manual has a number of distinctly helpful qualities:
• It captures the scientific goals and provenance of the code.
• It has been peer-reviewed.
• It is archived.
• It can be cited.
However, theory manuals have disadvantages as well. Typically:
• They represent significant effort.
• They are not living documents.
• They do not describe implementation.
• They are not stored alongside the code.
A theory manual is a decidedly necessary and important piece of the documentation
menagerie for research software. However, integrating additional documentation into
the software development workflow can break the problem into more manageable
tasks, allow the documentation to evolve along with the code base, and illuminate
implementation decisions.
The theory manual, as its title might suggest, describes the theory, but rarely
describes the implementation.
430
|
Chapter 19: Documentation
www.it-ebooks.info
User and Developer Guides
Similar to theory manuals, user guides often accompany mature research software.
These documents address more important implementation details and instruction for
use of the software. Unless generated automatically, however, they also represent sig‐
nificant effort on the part of the developers and are typically updated only when the
developers release a new version of the code.
Readme Files
In many code projects, a plain-text file sits among the source code files. With a name
like “readme,” it hopes not to be ignored. In most projects, the file is located in the
top-level directory and contains all the necessary information for installing, getting
started with, and understanding the accompanying code. In other projects, however,
a readme file might live in every directory or might be accompanied by other files
with more specific goals, like:
• install
• citation
• license
• release
• about
However, readme files are very common, especially in projects where users or devel‐
opers are likely to install the code from source. Since readme files are as unique as the
developers who write them, their contents are not standardized. However, the follow‐
ing is an example:
SQUIRREL, version 1.2 released on 2026-09-20
# About
The Spectral Q and U Imaging Radiation Replicating Experimental Library
(SQUIRREL) is a library for replicating radiation sources with spectral details
and Q and U polarizations of superman bubblegum.
# Installation
The SQUIRREL library relies on other libraries:
- The ACORN library www.acorn.nutz
- The TREEBRANCH database format API
Install those before installing the SQUIRREL library. To install the SQUIRREL
library:
Types of Documentation
www.it-ebooks.info
|
431
./configure
make --prefix=/install/path
make install
...
Rather than being archived in the university library, in a journal article, or in a
printed, bound copy on the shelf of the lead developer, the readme lives alongside the
code. It is therefore more easily discoverable by individuals browsing the source code
on their own. GitHub, in a nod to the ubiquity of the readme file, renders each
readme file on the landing page of the directory containing it.
However, a readme is only one plain-text file, so it can only reasonably hope to com‐
municate the very bare minimum of information about the code base. Techniques
that improve readme files include markup formats, installation instructions, minimal
examples, and references to additional information.
Comments
A comment is a line in code that is not run or compiled. It is merely there for the
benefit of the reader, to help with interpreting code. Comments, ideally, assist us
when we face code written by other people or, often, our past selves. As discussed in
previous chapters, code comments are denoted syntactically by special characters and
are not read when the code is executed.
Code commenting syntax provides a mechanism for inserting metainformation
intended to be read by human eyes. In Python, comments can be denoted by a few
different special characters. The # precedes comments that occupy one line or less.
For longer comments and docstrings, triple quotes or apostrophes are used:
def the_function(var):
"""This is a docstring, where a function definition might live"""
a = 1 + var # this is a simple comment
return a
However, comments can also pollute code with unnecessary cruft, as in the following
example:
def decay(index, database):
# first, retrieve the decay constants from the database
mylist = database.decay_constants()
# next, try to access an element of the list
try:
d = mylist[index] # gets decay constant at index in the list
# if the index doesn't exist
except IndexError:
# throw an informative error message
raise Exception("value not found in the list")
return d
432
|
Chapter 19: Documentation
www.it-ebooks.info
In this way, it is decidedly possible to over-document code with clutter. Comments
should never simply repeat what the code is doing. Code, written cleanly, will have its
own voice.
Nearly all of the comments in the previous example are unnecessary. It is obvious, for
example, that database.decay_constants() retrieves decay constants from the data
base object. Due to good variable naming, the comment adds nothing extra.
Indeed, the need for most comments can be reduced with intelligent naming deci‐
sions. For example, if the variable d in the preceding example were instead called
decay_constant or lambda, the standard mathematical symbol for the decay con‐
stant, the purpose of that line of code would be clear even without the comment. A
better version of this function might be:
def decay(index, database):
lambdas = database.decay_constants()
try:
lambda_i = lambdas[index]
# gets decay constant at index in the list
except IndexError:
raise Exception("value not found in the list")
return lambda
Finally, comments can get out of date if they are not updated along with the code.
Even though they’re immediately adjacent to the code they describe, they’re easy to
miss when fixing a bug on the fly. For example, imagine that a change is made else‐
where in the code base such that the database.decay_constants() function starts to
return a dictionary, rather than a list.
The keys are all the same as the previous indices, so this doesn’t cause a problem for
the decay function. It still passes all but one of the tests: the one that checks the excep‐
tion behavior. That test fails because an IndexError is no longer raised for the wrong
index. Instead, because the dictionary analogy to IndexError is KeyError, what is
raised is a KeyError. This is not caught by the except clause, and the test fails.
To fix this problem, the developer changes the caught exception to the more general
LookupError, which includes both IndexErrors and KeyErrors:
def decay(index, database):
lambdas = database.decay_constants()
try:
lambda_i = lambdas[index]
# gets decay constant at index in the list
except LookupError:
raise Exception("value not found in the decay constants object")
return lambda
However, when making the change, the developer may never have laid eyes on any
other line in this function. So, the comment has remained and states that lambdas is a
list. For new users of the code, the comment will lead them to believe that the
decay_constants object is a list.
Types of Documentation
www.it-ebooks.info
|
433
How would you fix this code? Perhaps the whole function is better off without the
comment entirely. Can you think of anything else that should be changed in this
example? The answers to both of these questions can be found in the concept of selfdocumenting code.
Self-Documenting Code
The only documentation that is compiled and tested for accuracy along with the code
is the code.
In the exceptional book Clean Code, Robert C. Martin discusses many best practices
for self-documenting code. Most of his principles of clean, self documenting code
revolve around the principle that the code should be understandable and should
speak for itself. Transparently written, clean code, after all, hides bugs poorly and
frightens away fewer developers. We’ll look at a few of those best practices here.
Naming
Chief among best practices is naming, which has already been covered somewhat. A
variable, class, or function name, Martin says:
…should answer all the big questions. It should tell you why it exists, what it does, and
how it is used. If a name requires a comment, then the name does not reveal its intent.
In the previous example, among other things that should be changed, the decay()
function should probably be renamed to decay_constant(). For more clarity, one
might consider get_decay_constant() or get_lambda() so that the user can guess
that it actually returns the value.
Simple functions
As has been mentioned previously, especially in Chapter 18, functions must be small
in order to be understandable and testable. In addition to this, they should do only
one thing. This rule helps code readability and usability enormously. When hidden
consequences are not present in a function, the DRY (don’t repeat yourself) principle
can be used confidently.
Consistent style
Finally, a key feature in readability is rich syntactic meaning. Programming languages
derive their vast power from the density of meaning in their syntax. However, any
language can be made rich beyond its defined parameters by use of consistent, stand‐
ardized style.
When variable and function names are chosen with a particular syntactic style, they
will speak volumes to the trained eye. Every language has at least one commonly used
style guide that establishes a standard. In Python, that style guide is PEP8.
434
|
Chapter 19: Documentation
www.it-ebooks.info
In addition to dictating the proper number of spaces of indentation in Python code,
PEP8 also suggests variable and function naming conventions that inform the devel‐
oper of the intended purpose and use of those variables and functions. In particular:
# packages and modules are short and lowercase
packages
modules
# other objects can be long
ClassesUseCamelCase
ExceptionsAreClassesToo
functions_use_snake_case
CONSTANTS_USE_ALL_CAPS
# variable scope is *suggested* by style convention
_single_leading_underscore_
# internal to module
single_trailing_underscore_
# avoids conflicts with Python keywords
__double_leading_and_trailing__ # these are magic, like __init__
The syntactic richness demonstrated here increases the information per character of
code and, accordingly, its power.
Docstrings
As discussed in Chapter 5, Python documentation relies on docstrings within func‐
tions. As a reminder, a docstring is placed immediately after the function declaration
and is the first unassigned string literal. It must occur before any other operations in
the function body. To span multiple lines, docstrings are usually enclosed by three
pairs of double quotes:
def ():
""""""
Docstrings should be descriptive and concise. They provide an incredibly handy way
to convey the intended use of the functions to users. In the docstring, it is often useful
to explain the arguments of a function, its behavior, and how you intend it to be used.
The docstring itself is available at runtime via Python’s built-in help() function and is
displayed via IPython’s ? magic command. The Python automated documentation
framework, Sphinx, also captures docstrings. A docstring could be added to the
power() function as follows:
def power(base, x):
"""Computes base^x. Both base and x should be integers,
floats, or another numeric type.
"""
return base**x
Types of Documentation
www.it-ebooks.info
|
435
In addition to giving your audience the gift of informative type definitions and vari‐
able names, it is often useful to explain a class, its purpose, and its intended contents
in a comment near its declaration. Python does this using docstrings as well:
class Isotope(object):
"""A class defining the data and behaviors of a radionuclide.
"""
Further documentation about Python docstrings can be found in PEP257. Addition‐
ally, docstrings are an excellent example of comments that can be structured for use
with automated documentation generators. For more on their importance in the use
of Sphinx, read on.
Automation
While taking the time to add comments to code can be tedious, it pays off hand‐
somely when coupled with an automated documentation generation system. That is,
if comments are constructed properly, they can be read and interpreted, in the con‐
text of the code, to generate clickable, interactive documentation for publication on
the Internet.
Tools for automatically creating documentation exist for every language. Table 19-1
shows a few of the most popular offerings. In Java, it’s Javadoc; for C and C++, a com‐
mon tool is Doxygen. For Python, the standard documentation generator is Sphinx.
Table 19-1. Automated documentation frameworks
Name
Description
Doxygen Supports marked-up comments, created for C++
Javadoc
Supports marked-up comments, created for Java
Pandoc
Supports Markdown, reStructuredText, LaTeX, HTML, and others
Sphinx
Standard Python system; supports reStructuredText
With these tools, well-formed comments in the code are detected and converted into
navigable API documentation. For an example of the kind of documentation this can
create, browse the documentation for the Python language (version 3). In keeping
with our focus on Python, we’ll look at Sphinx here.
Sphinx
Sphinx was created to automate the generation of the online Python 3 API documen‐
tation. It is capable of creating theory manuals, user guides, and API documentation
436
|
Chapter 19: Documentation
www.it-ebooks.info
in HTML, LaTeX, ePub, and many other formats. It does this by relying on restruc‐
tured text files defining the content. With an extension called “autodoc,” Sphinx is
also capable of using the docstrings in source code to generate an API-documenting
final product.
Sphinx is a documentation system primarily for documenting Python code. This sec‐
tion will simply detail getting started with Sphinx and the autodoc extension. For a
more detailed tutorial on Sphinx, see the Sphinx documentation.
Getting started
Sphinx is packaged along with any scientific Python distribution (like Anaconda or
Canopy). The tool itself provides a “quickstart” capability. This section will cover how
to use that quickstart capability to build a simple website with *.rst files and the com‐
ments in source code.
As an example, we’ll use the object code used to demonstrate classes in Chapter 6.
Documentation for this code can be generated in a few simple steps. First, enter the
directory containing the source code and create a directory to contain the documen‐
tation:
~ $ cd book-code/obj
~/book-code/obj $ mkdir doc
Next, enter the doc directory and execute the Sphinx quickstart utility:
~/book-code/obj $ cd doc
~/book-code/obj/doc $ sphinx-quickstart
This utility is customized by answers from the user, so be ready to answer a few ques‐
tions and provide some details about your project. If unsure about a question, just
accept the default answer. To prepare for automatic documentation generation, be
sure to answer “yes” to the question about autodoc (“autodoc: automatically insert
docstrings from modules (y/n)”).
This step allows the documentation’s arrangement to be customized carefully. It will
create a few new files and directories. Typically, these include:
• A source directory for holding .rst files, which can be used to hold user guides
and theory manual content or to import documentation from the code package
• A makefile that can be used to generate the final product (by executing make
html, in this case)
• A build directory to hold the final product (in this case, .html files comprising the
documentation website)
Automation
www.it-ebooks.info
|
437
Once the quickstart step has been completed, you can modify the files in the source
directory and add to them in order to create the desired structure of the website. The
source directory will include at least:
• A conf.py file, which can be used to customize the documentation and define
much of the metadata for your project
• An index.rst file, which will be the landing page of the website and can be cus‐
tomized to define the structure of its table of contents
The documentation in the build directory is based on files in the source directory. To
include documentation for a particular module, such as particle.py, you can create a
corresponding .rst file (particle.rst) that invokes autodoc on that class. The index.rst
file must also be modified to include that file. In the end, our index.rst file should look
like:
.. particles documentation master file, created by
sphinx-quickstart on Sun Jan 1 23:59:59 2999.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to particles's documentation!
=====================================
Contents:
.. toctree::
:maxdepth: 2
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
API
====
.. toctree::
:maxdepth: 1
particle
And the particle.rst file should look like:
.. _particles_particle:
Particle -- :mod:`particles.particle`
=====================================
438
|
Chapter 19: Documentation
www.it-ebooks.info
.. currentmodule:: particles.particle
.. automodule:: particles.particle
All functionality may be found in the ``particle`` package::
from particles import particle
The information below represents the complete specification of the classes in
the particle module.
Particle Class
***************
.. autoclass:: Particle
Now, Sphinx has been informed that there is a particle.py module in the particles
package, and within that module is a Particle class that has docstrings to be
included in the documentation. This will work best if the docstrings are well formed.
Read on to find out more about how to format your docstrings for Sphinx.
Comment style
You can get more functionality out of Sphinx by formatting your docstrings in a syn‐
tax that it can parse easily. While Sphinx will often pick up comments just before a
function declaration, even if it is blank, you can control more of its behavior with
specific notation. A reference for this notation is on the Sphinx website, but to give
you an idea, here is an example of the Sphinx syntax for documenting a function:
.. function:: spin(self, s)
Set the spin of the particle to the value, s.
You can also add more detail, with specific syntax. With the help of this syntax,
Sphinx can interpret parts of the comment that are intended to illuminate the param‐
eters or the return value, for instance. In this case the function might have a comment
like:
.. function:: spin(self, s)
Set the spin of the particle to the value, s.
:param s: the new spin value
:type s: integer or float
:rtype: None
Now, armed with this syntax, take some time with a Python code base of your own.
Go back and make appropriate changes to the comments in that code in order to pro‐
vide Sphinx-style syntax for some of the key functions, classes, and variables. Then,
Automation
www.it-ebooks.info
|
439
try running sphinx-quickstart and the Sphinx autodoc extension to generate docu‐
mentation accordingly.
Documentation Wrap-up
In this chapter, you have learned how to use comments to communicate the meaning
and purpose of your code to future users unfamiliar with its implementation. Proper
documentation will have an enormous impact on the usability and reusability of your
software, both by others and your future self. Additionally, you have learned how to
automate the generation of interactive and comprehensive API documentation based
on appropriately styled comments.
Equipped with these skills, you can distribute code to your colleagues and it will serve
them as more than just a black box. With proper API documentation, your code
becomes a legitimate research product. Of course, even though code is one of the
most useful kinds of modern research product, recognition is hard to gain without
journal publication as well. For help becoming more effective at publication, proceed
to the next chapter.
440
|
Chapter 19: Documentation
www.it-ebooks.info
CHAPTER 20
Publication
In science one tries to tell people, in such a way as to be understood by everyone,
something that no one ever knew before. But in poetry, it’s the exact opposite.
—Paul Dirac
One day, I’ll find the right words, and they will be simple.
—Jack Kerouac
Publishing is an integral part of science. Indeed, the quality, frequency, and impact of
publication records make or break a career in the physical sciences. Publication can
take up an enormous fraction of time in a scientific career. However, with the right
tools and workflow, a scientist can reduce the effort spent on mundane details (e.g.,
formatting, reference management, merging changes from coauthors) in order to
spend more time on the important parts (e.g., literature review, data analysis, writing
quality). This chapter will emphasize tools that allow the scientist to more efficiently
accomplish and automate the former in order to focus on the latter. It will cover:
• An overview of document processing paradigms
• Employing text editors in document editing
• Markup languages for text-based document processing
• Managing references and automating bibliography creation
The first of these topics will be an overview of two competing paradigms in docu‐
ment processing.
Document Processing
Once upon a time, Homo sapiens etched our thoughts into stone tablets, on papyrus,
and in graphical document processing software. All of these early tools share a com‐
441
www.it-ebooks.info
monality. They present the author with a “What You See Is What You Get” (WYSI‐
WYG) paradigm in which formatting and content are inextricably combined. While
this paradigm is ideal for artistic texts and documents with simple content, it can dis‐
tract from the content in a scientific context and can make formatting changes a non‐
trivial task. Additionally, the binary format of most WYSIWYG documents increases
the difficulty of version control, merging changes, and collaborating on a document.
Common document processing programs include:
• Microsoft Word
• Google Docs
• Open Office
• Libre Office
Though these tools have earned an important seat at the table in a business environ‐
ment, they lack two key features that assist in efficient, reproducible paper-writing
workflows. The first shortcoming is that they fail to separate the content (text and
pictures) from the formatting (fonts, margins, etc.).
Separation of Content from Formatting
In the context of producing a journal publication, formatting issues are purely a dis‐
traction. In a WYSIWYG word processor, the act of choosing a title is polluted by the
need to consider font, placement, and spacing. In this way, WSYIWYG editors fail to
separate content from formatting. They thus prevent the author from focusing on
word choice, clarity of explanation, and logical flow.
Furthermore, since each journal requires submissions to follow unique formatting
guidelines, an efficient author avoids formatting concerns until a journal home is
chosen. The wise author separates the content from the formatting since that choice
may even be reevaluated after a rejection.
For all of these reasons, this chapter recommends a What You See Is What You Mean
(WYSIWYM) document processing system for scientific work. Most such systems are
plain text–based and rely on markup languages. Some common systems include:
• LaTeX
• DocBook
• AsciiDoc
• PanDoc
Among these, this chapter recommends the LaTeX context. Due to its powerful inter‐
face, beautiful mathematical typesetting, and overwhelming popularity in the physical
442
|
Chapter 20: Publication
www.it-ebooks.info
sciences, LaTeX is a fundamental skill for the effective researcher in the physical sci‐
ences.
Plain-text WYSIWYM tools such as these cleanly separate formatting from content.
In a LaTeX document, layout specifications and formatting choices can be placed in a
completely separate plain-text file than the ones in which the actual content of the
paper is written. Because of this clean separation, switching from a document layout
required by journal A to the layout required by journal B is done by switching the
style files accordingly. The content of the paper is unaffected.
Additionally, this clean separation can enable efficient reference management. In the
LaTeX context, this chapter will cover how this is achieved with bibliography files.
One more reason we recommend WYSIWYM editors over WYSIWYG documentprocessing tools is related to reproducibility: they facilitate tracking changes.
Tracking Changes
At the advent of computing, all information was stored as plain text. Now, informa‐
tion is stored in many complex binary formats. These binary formats are difficult to
version control, since the differences between files rarely make logical sense without
binary decoding. Many WYSIWYG document processors rely on such binary formats
and are therefore difficult to version control in a helpful way.
Of course, an adept user of Microsoft Word will know that changes in that program
can be tracked using its internal proprietary track-changes tool. While this is a dra‐
matic improvement and enables concurrent efforts, more transparent and robust ver‐
sioning can be achieved with version-controlled plain-text files. Since Microsoft’s
model requires that one person maintain the master copy and conduct the merges
manually, concurrent editing by multiple parties can become untenable.
At this point in the book, you should have an appreciation of the effectiveness, effi‐
ciency, and provenance of version-controllable plain-text formats. Of course, for the
same reasons, we strongly recommend the use of a plain-text markup language for
document processing. Accordingly, we also recommend choosing a text editor appro‐
priate for editing plain-text markup.
Text Editors
The editor used to write code can also be used to write papers. As mentioned in
Chapter 1, text editors abound. Additionally, most text editors are very powerful.
Accordingly, it can be a challenge to become proficient in the many features of more
than one text editor. When new programmers seek to make an informed decision
about which text editor to learn and use, many well-meaning colleagues may try to
influence their choice.
Text Editors
www.it-ebooks.info
|
443
However, these well-meaning colleagues should be largely ignored. A review of the
features of a few common text editors (e.g., vi, emacs, eclipse, nano) should suffi‐
ciently illuminate the features and drawbacks of each.
Another argument for the use of plain-text markup is exactly this
array of available text editors. That is, the universality of plain-text
formatting and the existence of an array of text editors allows each
collaborator to choose a preferred text editor and still contribute to
the paper. On the other hand, with WYSIWYG, proprietary for‐
mats require that everyone must use the same tool.
Your efficiency with your chosen editor is more important than which text editor you
choose. Most have a basic subset of tools (or available plug-ins) to accomplish:
• Syntax highlighting
• Text expansion
• Multiple file buffers
• Side-by-side file editing
• In-editor shell execution
Technical writing in a text editor allows the distractions of formatting to be separated
from the content. To achieve this, the elements of a document are simply “marked up”
with the special text syntax of a markup language.
Markup Languages
Markup languages provide syntax to annotate plain-text documents with structural
information. A build step then produces a final document by combining that textual
content and structural information with separate files defining styles and formatting.
Most markup languages can produce multiple types of document (i.e., letters, articles,
presentations) in many output file formats (.pdf, .html).
The ubiquitous HyperText Markup Language (HTML) may provide a familiar exam‐
ple of this process. Plain-text HTML files define title, heading, paragraph, and link
elements of a web page. Layouts, fonts, and colors are separately defined in CSS files.
In this way, web designers can focus on style (CSS) while the website owner can focus
on the content (HTML).
This chapter will focus on the LaTeX markup language because it is the standard for
publication-quality documents in the physical sciences. However, it is not the only
available markup language. A few notable markup languages include:
• LaTeX
444
|
Chapter 20: Publication
www.it-ebooks.info
• Markdown
• reStructuredText
• MathML
• OpenMath
Markdown and reStructuredText both provide a simple, clean, readable syntax and
can generate output in many formats. Python and GitHub users will encounter both
formats, as reStructuredText is the standard for Python documentation and Mark‐
down is the default markup language on GitHub. Each has syntax for including and
rendering snippets of LaTeX. MathML and its counterpart OpenMath are optional
substitutes for LaTeX, but lack its powerful extensions and wide adoption.
In markup languages, the term markup refers to the syntax for denoting the structure
of the content. Content structure, distinct from formatting, enriches plain-text con‐
tent with meaning. Directives, syntactically defined by the markup language, denote
or mark up titles, headings, and sections, for example. Similarly, special characters
mark up document elements such as tables, links, and other special content. Finally,
rather than working in a single huge document, most markup languages enable con‐
structing a document from many subfiles. In this way, complex file types, like images,
can remain separate from the textual content. To include an image, the author simply
references the image file by providing its location in the filesystem. In this way, the
figures in a paper can remain in their native place on the filesystem and in their origi‐
nal file format. They are only pulled into the final document during the build step.
The build step is governed by the processing tool. For HTML, the tool is your
browser. For the LaTeX markup language, however, it is the LaTeX program. The next
section will delve deeper into LaTeX.
LaTeX
LaTeX (pronounced lay-tekh or lah-tekh) is the standard markup language in the
physical sciences. Based on the TeX literate programming language, LaTeX provides a
markup syntax customized for the creation of beautiful technical documents.
At a high level, a LaTeX document is made up of distinct constituent parts. The main
file is simply a text file with the .tex file extension. Other LaTeX-related files may
include style files (.sty), class files (.cls), and bibliography files (.bib). However, only
the .tex file is necessary. That file need only contain four lines in order to constitute a
valid LaTeX document. The first line chooses the type of document to create. This is
called the LaTeX document class.
Markup Languages
www.it-ebooks.info
|
445
LaTeX document class
The first required line defines the type of document that should result. Common
default options include article, book, and letter. The syntax is:
\documentclass{article}
This is a typical LaTeX command. It has the format:
\commandname[options]{argument}
The documentclass type should be listed in the curly braces. Options concerning the
paper format and the font can be specified in square brackets before the curly braces.
However, they are not necessary if the default styles are desired.
Note that many journals provide something called a class file and sometimes a style
file, which contain formatting commands that comply with their requirements. The
class file fully defines a LaTeX document class. So, for example, the journal publisher
Elsevier provides an elsarticle document class. In order to convert any article into
an Elsevier-compatible format, simply download the elsarticle.cls file to the directory
containing the .tex files, and change the documentclass command argument to elsar
ticle. The rest of the document can stay the same.
The next two necessary lines are the commands that begin and end the document
environment.
LaTeX environments
LaTeX environments are elements of a document. They can contain one another, like
Russian dolls, and are denoted with the syntax:
\begin{environment} ... \end{environment}
\begin{environment} and \end{environment} are the commands that indicate envi‐
ronments in LaTeX. The top-level environment is the document environment. The
document class, packages used, new command definitions, and other metadata
appear before the document environment begins. This section is called the preamble.
Everything after the document environment ends is ignored. For this reason, the
\begin{document} command and the \end{document} command must each appear
exactly once:
\documentclass{article}
\begin{document}
\end{document}
Since all actual content of the document appears within the document environment,
between the \begin{document} and \end{document} commands, the shortest possi‐
ble valid LaTeX file will include just one more line, a line of content!
446
|
Chapter 20: Publication
www.it-ebooks.info
\documentclass{article}
\begin{document}
Hello World!
\end{document}
This is a completely valid LaTeX document. Note that no information about fonts,
document layout, margins, page numbers, or any other formatting details need clut‐
ter this document for it to be valid. However, it is only plain text right now. To render
this text as a PDF, we must build the document.
Building the document
If the preceding content is placed into a document—say, hello.tex—a PDF document
can be generated with two commands. The first runs the latex program, which com‐
piles and renders a .dvi file. The second converts the .dvi file to the portable docu‐
ment format .pdf:
$ latex hello.tex
$ dvipdf hello.dvi
LaTeX uses the .tex file to create a .dvi file.
The .dvi file can be directly converted to .pdf with dvipdf.
Alternatively, if pdflatex is installed on your computer, that command can be used to
accomplish both steps at once:
$ pdflatex hello.tex
As shown in Figure 20-1, the document is complete and contains only the text “Hello
World!”
Figure 20-1. Hello World!
Now that the simplest possible document has been created with LaTeX, this chapter
can move on to using LaTeX to produce publication-quality scientific documents.
The first step will be to show how to appropriately mark up metadata elements of the
document, such as the author names and title.
LaTeX metadata
Document metadata, such as the title of the document and the name of the author,
may appear in many places in the document, depending on the layout. To make these
special metadata variables available to the whole document, we define them in a
scope outside of the document environment. The preamble holds information that
Markup Languages
www.it-ebooks.info
|
447
can help to define the document; it typically includes, at the very minimum, a
\title{}* and \author{}, but can include other information as well.
When Ada Lovelace, often cited as history’s first computer programmer, began to
write the first mathematical algorithm, she wrote it in impeccable Victorian hand‐
writing on reams of paper before it was typeset and reprinted by a printing press.
This algorithm appeared in the appendices of a detailed technical description of its
intended computer, Charles Babbage’s Analytical Engine. The document itself, clev‐
erly crafted in the span of nine months, contained nearly all the common features of a
modern article in the physical sciences. It was full of mathematical proofs, tables, logi‐
cal symbols, and diagrams. Had she had LaTeX at her disposal at the time, Ada might
have written the document in LaTeX. She would have begun the document with
metadata in the preamble as seen here:
% notes.tex
\documentclass[11pt]{article}
\author{Ada Augusta, Countess of Lovelace}
\title{Notes By the Translator Upon the Memoir: Sketch of the Analytical Engine
Invented by Charles Babbage}
\date{October, 1842}
\begin{document}
\maketitle
\end{document}
In LaTeX, comments are preceded by a percent symbol.
Ada would like to create an article-type document in 11pt font.
She provides her formal name as the author metadata.
She provides the full title.
Another piece of optional metadata is the date.
The document environment begins.
The \maketitle command is executed. It uses the metadata to make a title.
The document environment ends.
448
|
Chapter 20: Publication
www.it-ebooks.info
Figure 20-2. A Title in LaTeX
Ada’s name, as well as the title of the article, should be defined in the preamble. How‐
ever, they are only rendered into a main heading in the document with the use of the
\maketitle command, which takes no arguments and must be executed within the
document environment. The document that is produced appears in Figure 20-2.
Exercise: Create a Document with Metadata
1. Create the notes.tex file in the previous code listing.
2. Run latex notes.tex and dvipdf notes.tex to create a .pdf.
3. View it.
4. Remove the value for the date so that it reads \date{}.
5. Repeat steps 2 and 3. What changed?
Now that the basics are clear, scientific information can be added to this document.
In support of that, the document will need some underlying structure, such as sec‐
tions and subsections. The next section will show how LaTeX markup can be used to
demarcate those structural elements of the document.
Document structure
In the body of the document, the document structure is denoted by commands
declaring the titles of each structural element. In the article document class, these
include sections, subsections, subsubsections, and paragraphs. In a book, the struc‐
ture includes parts and chapters as well. Ada’s foundational notes were lettered A
through G. The body of her document, therefore, would have included one \section
command for each section:
% notes.tex
\documentclass[11pt]{article}
Markup Languages
www.it-ebooks.info
|
449
\author{Ada Augusta, Countess of Lovelace}
\title{Notes By the Translator Upon the Memoir: Sketch of the Analytical Engine
Invented by Charles Babbage}
\date{October, 1842}
\begin{document}
\maketitle
\section{Note
\section{Note
\section{Note
\section{Note
\section{Note
\section{Note
\section{Note
A}
B}
C}
D}
E}
F}
G}
\end{document}
Since each note is a separate entity, however, it may be wise for Ada to keep them in
separate files to simplify editing. In LaTeX, rather than keeping all the sections in one
big file, Ada can include other LaTeX files in the master file. If the content of Note A,
for example, is held in its own intro.tex file, then Ada can include it with the \input{}
command. In this way, sections can be moved around during the editing process with
ease. Additionally, the content is then stored in files named according to meaning
rather than document order:
\section{Note A}
\input{intro}
\section{Note B}
\input{storehouse}
...
\section{Note G}
\input{conclusion}
Any text and LaTeX syntax in intro.tex will be inserted by LaTeX at the line where the
command appeared. This multiple-file-inclusion paradigm is very powerful and
encourages the reuse of document subparts. For example, the text that acknowledges
your grant funding can stay in just one file and can be simply input into each paper.
Now that the document has a structure, we can get to work filling in the text and
equations that make up the content of the paper. That will utilize the most important
capability in LaTeX: typesetting math.
Typesetting mathematical formulae
LaTeX’s support for mathematical typesetting is unquestionably the most important
among its features. LaTeX syntax for typesetting mathematical formulae has set the
standard for technical documents. Publication-quality mathematical formulae must
450
|
Chapter 20: Publication
www.it-ebooks.info
include beautifully rendered Greek and Latin letters as well as an enormous array of
logical, mathematical symbols. Beyond the typical symbols, LaTeX possesses an enor‐
mous library of esoteric ones.
Some equations must be rendered inline with paragraph text, while others should be
displayed on a separate line. Furthermore, some must be aligned with one another,
possess an identifying equation number, or incorporate interleaved text, among other
requirements. LaTeX handles all of these situations and more.
To render math symbols or equations inline with a sentence, LaTeX math mode can
be denoted with a simple pair of dollar signs ($). Thus, the LaTeX syntax shown here
is rendered as in Figure 20-3:
The particular function whose integral the Difference Engine was constructed to
tabulate, is $\Delta^7u_x=0$. The purpose which that engine has been specially
intended and adapted to fulfil, is the computation of nautical and astronomical
tables. The integral of $\Delta^7u_x=0$ being
$u_z =
a+bx+cx^2+dx^3+ex^4+fx^5+gx^6$, the constants a, b, c, &c. are represented on the
seven columns of discs, of which the engine consists.
Note the dollar signs denoting the beginning and end of each inline mathematical
equation. In an equation, mathematical markup can be used. Symbols, like the capital
Greek letter delta, are denoted with a backslash. The caret (^) indicates a following
superscript, and an underscore (_) means subscript.
Figure 20-3. Inline equations
Alternatively, to display one or more equations on a line separated from the text, an
equation-focused LaTeX environment is used:
In fact the engine may be described as being the material expression
of any indefinite function of any degree of generality and complexity,
such as for instance,
\begin{equation}
F(x, y, z, \log x, \sin y, x^p),
\end{equation}
which is, it will be observed, a function of all other possible
functions of any number of quantities.
Markup Languages
www.it-ebooks.info
|
451
An equation environment denotes an equation separated from the text, nicely
centered.
In this environment, mathematical markup can be used.
The equation is thereby drawn out of the text and is automatically given an equation
number, as in Figure 20-4.
Figure 20-4. The equation environment
LaTeX enables a multitude of such mathematical typesetting conventions common to
publications in the physical sciences. For example, multiple equations can be beauti‐
fully aligned with one another using the align math environment and ampersands (&)
to mark the point of alignment. The American Mathematical Society made this possi‐
ble by creating a package that it has made available to LaTeX users. To use this align‐
ing environment, Ada will have to load the appropriate package when running
LaTeX. That is done in the preamble.
Packages Extend LaTeX Capabilities
In addition to metadata, the preamble often declares the inclusion of any packages
that the document relies on. Standard packages include amsmath, amsfonts, and ams
symb. These are the American Mathematical Society packages for math layouts, math
fonts, and math symbols, respectively. Another common package is graphicx, which
allows the inclusion of .eps figures.
The align environment is available in the amsmath package, so if Ada wants to use it,
she must include that package in her preamble. To enable an extended library of sym‐
bols, she might also include the amssymb package. Finally, since the description of the
Bernoulli algorithm for the Analytical Engine required enormous, detailed tables
spanning many pages, Ada might have also wanted to use the longtable package,
which enables flexible tables that break across pages. Here are the lines she’ll need to
add to her preamble:
452
|
Chapter 20: Publication
www.it-ebooks.info
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{longtable}
If she has added the amsmath package to her preamble, Ada can thus make beautifully
aligned equations with the align environment, as in the snippet shown here (rendered
in Figure 20-5):
The following is a more complicated example of the manner in which the
engine would compute a trigonometrical function containing variables.
To multiply
\begin{align}
&A+A_1 \cos \theta + A_2\cos 2\theta + A_3\cos 3\theta + ···
\intertext{by}
&B + B_1 \cos \theta.
\end{align}
The ampersand marks the place in this equation that should line up with the next
equation.
The progression of mathematical operations can be documented with common
interleaved phrases such as “where,” “such that,” or “which reduces to.” To inter‐
leave such text in a math environment, the \intertext command is used.
The ampersand in the second equation marks the place in this equation that lines
up with the ampersand of the first equation.
Figure 20-5. Aligned LaTeX
As you can see, the equations line up just where the ampersands were placed, but the
ampersands do not appear in the rendered document. Of course, this is only a taste of
mathematical typesetting capabilities in LaTeX, and equations are only half the battle.
Markup Languages
www.it-ebooks.info
|
453
How does LaTeX handle other elements of technical documents, such as tables and
figures?
Tables and figures
Tables and figures also often belong in their own files. In addition to the simplicity
gained by keeping such elements outside of the text, reusing these elements in other
documents becomes much simpler if they are kept in their own files. LaTeX is capable
(with the beamer package) of generating presentation-style documents, and these files
can be reused in those documents with a simple reference.
By keeping the figures themselves out of the main text file, the author can focus on
the elements of the figures that are related to the flow of the document: placement
relative to the text, captions, relative size, etc.
In Ada’s notes, diagrams of variables related to the algorithm were inserted. These
could be created in a LaTeX math environment, but they could also be included as
figures. The syntax for including an image is:
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=0.5\textwidth]{var_diagram}
\end{center}
\caption{Any set of columns on which numbers are inscribed, represents
merely a general function of the several quantities, until the special
function have been impressed by means of the Operation and
Variable-cards.}
\label{fig:var_diagram}
\end{figure}
The figure environment begins. Placement options are specified as (h)ere, (t)op,
(b)ottom, or on its own (p)age.
The figure should appear horizontally centered on the page.
The name of the image file is indicated and the width option specifies that the
figure should be half the width of the text.
A verbose caption is added.
A label is added so that the figure can be referenced later in the document using
this name tag.
The result of this syntax is shown in Figure 20-6. In it, the image is brought into the
document, numbered, sized, and captioned exactly as was meant.
454
|
Chapter 20: Publication
www.it-ebooks.info
Figure 20-6. Labels in LaTeX
In this example, the figure was labeled with the \label command so that it can be
referenced later in the document. This clean, customizable syntax for internal refer‐
ences is a feature of LaTeX that improves efficiency greatly. The next section will
show how such internal references work.
Internal references
The LaTeX syntax for referencing document elements such as equations, tables, fig‐
ures, and sections entirely eliminates the overhead of matching equation and section
numbers with their in-text references. The \ref{} command can be embedded in the
text to refer to these elements elsewhere in the document if the elements have been
labeled with the corresponding \label{} command.
At build time, LaTeX numbers the document elements and expands inline references
accordingly. Since tables, figures, and sections are often reordered during the editing
process, referring to them by meaningful labels is much more efficient than trying to
keep track of meaningless numbers. In Note D (example.tex), for instance, Ada
presents a complex example and refers to Note B (storehouse.tex). Since the ordering
of the notes might not have been finalized at the time of writing, referring to Note B
by a meaningful name rather than a number—or, in this case, a letter—is preferred.
To do this, Ada must use the \label{} command in the storehouse.tex file so that the
example.tex file may refer to it with the \ref{} command:
% storehouse.tex
\label{sec:storehouse}
That portion of the Analytical Engine here alluded to is called the
storehouse. . .
The section on the storehouse is stored in a file called storehouse.tex.
That section is called Note B, but Ada remembers it as the storehouse section,
and labels it such.
Markup Languages
www.it-ebooks.info
|
455
In this example, the label uses the prefix sec: to indicate that storehouse is a section.
This is not necessary. However, it is a common and useful convention. Similarly, fig‐
ures are prepended with fig:, tables with tab:, and so on. Ada can now reference
Note B from Note D as shown here:
% example.tex
We have represented the solution of these two equations below, with
every detail, in a diagram similar to those used in Note
\ref{sec:storehouse}; ...
Note D is held in example.tex.
Ada can reference Note B within this file using the memorable label.
When the document is built with these two files, a “B” will appear automatically
where the reference is. The same can be achieved with figure and table labels as well.
Now it is clear how to reference figures and sections. However, there is another kind
of reference common in publications. Bibliographic references (citations) are also
automated, but are handled a bit differently in LaTeX. The next section will explain
how.
Bibliographies
An even more powerful referencing feature of LaTeX is its syntax for citation of bib‐
liographic references and its automated formatting of bibliographies. Using BibTeX
or BibLaTeX, bibliography management in LaTeX begins with .bib files. These contain
information about resources cited in the text sufficient to construct a bibliography.
Had Ada desired to cite the Scientific Memoir her notes were concerned with, she
might have defined that work in a refs.bib file as follows:
% refs.bib
@article{menabrea_sketch_1842,
series = {Scientific Memoirs},
title = {Sketch of The Analytical Engine Invented by Charles Babbage},
volume = {3},
journal = {Taylor's Scientific Memoirs},
author = {Menabrea, L.F.},
month = oct,
year = {1842},
pages = {666--731}
}
To cite this work in the body of her text and generate an associated bibliography, Ada
must do only three things. First, she uses the \cite{} command, along with the key
(menabrea_sketch_1842), where she wants the reference to appear:
% intro.tex
...
These cards contain within themselves (in a manner explained in the Memoir
456
|
Chapter 20: Publication
www.it-ebooks.info
itself \cite{menabrea_sketch_1842}) the law of development of the particular
function that may be under consideration, and they compel the mechanism to act
accordingly in a certain corresponding order.
...
Second, she must include a command placing the bibliography. Bibliographies appear
at the end of a document, so just before the \end{document} command in her main
notes.tex file, Ada adds two lines:
% notes.tex
...
\section{Note G}
\input{conclusion}
\bibliographystyle{plain}
\bibliography{refs}
\end{document}
These together define the bibliography style. The choices for this parameter are
myriad. The simplest choice is often “plain,” as has been used here. However, a oneword change can alter the formatting to comply with Chicago, MLA, or any other
bibliography formatting style. The second line names the location(s) of the .bib file(s).
The final necessary step is to build the bibliography along with the document. For
this, an extra build step is required that employs the bibtex command. In a peculiar‐
ity of LaTeX, for the references to appear, you must call latex again twice after issu‐
ing the bibtex command. So, at the command line, Ada must type:
$
$
$
$
$
latex notes
bibtex notes
latex notes
latex notes
dvipdf notes
The result is marvelous. In the text, the \cite command is replaced with “[1]”, and
on the final page of her document, a bibliography appears as in Figure 20-7.
Figure 20-7. Automated bibliography generation
Never again need scientists concern themselves with the punctuation after a title in
an MLA-style bibliography—LaTeX has automated this. The only thing LaTeX does
not automate about bibliography creation is reading the papers and making the .bib
Markup Languages
www.it-ebooks.info
|
457
file itself. Thankfully, other tools exist to make that process more efficient. The next
section introduces these.
Reference management
To generate a .bib file easily, consider using a reference manager. Such a tool helps to
collect and organize bibliographic references. By helping the researcher automate the
collection of metadata about journal articles and other documents, as well as the pro‐
duction of .bib files, these tools eliminate the tedious task of typing names, titles, vol‐
ume numbers, and dates for each reference cited in a paper. It can all be completely
automated. A number of open source tools for this task exist. These include, among
others:
• BibDesk
• EndNote
• JabRef
• Mendeley
• RefWorks
• Zotero
Reference managers help researchers to organize their sources by storing the meta‐
data associated with them. That metadata can typically be exported as .bib files.
Citing Code and Data
One thing that BibTeX lacks is a metadata format appropriate for uniquely referenc‐
ing code or data, unless it has a digital object identifier (DOI) number associated with
it. For truly reproducible publication, you should cite the code and data that pro‐
duced the analysis using a DOI.
Each commit in your version-controlled code repository has a commit hash number
that distinguishes it uniquely from others. For unique identification in a library or
bookstore, this book has an ISBN. Analogously, data and software objects can be
identified in a persistent way with a DOI number.
It is possible to acquire a DOI for any piece of software using archival services on the
Internet. Some are even free and open source.
The use of these reference managers is outside the scope of this chapter. Please go to
the individual tools’ websites to get started using them.
458
|
Chapter 20: Publication
www.it-ebooks.info
Publication Wrap-up
Publication is the currency of a scientific career. It is the traditional way in which sci‐
entific work is shared with and judged by our peers. For this reason, scientists spend a
lot of time producing publication-quality documents. This chapter has sought to pro‐
vide an overview of the tools available to aid you in this pursuit and to give an intro‐
duction to the most ubiquitous, LaTeX. Now that you have read this chapter, you
should know that:
• Markup languages separate formatting from content.
• Markup-based text documents are more version-controllable.
• Many markup languages exist, but LaTeX is a particularly powerful tool for scien‐
tific publication.
In the context of LaTeX, you should also know how to:
• Produce a simple document
• Give structure to that document
• Add mathematical equations inline
• Display mathematics
• Include figures
• Reference those figures
• Cite bibliographic references
• Automate the creation of a bibliography
With these skills, you are equipped to begin generating lovely publication-quality
documents. Many resources are available online to enrich what you have learned
here. Two favorites are:
• “The Not So Short Introduction to LaTeX”, by Tobias Oetiker et al.
• Tex-LaTeX Stack Exchange
Publication Wrap-up
www.it-ebooks.info
|
459
IPython Notebook
As an aside, please note that another option for reproducible document creation that
was not mentioned in this chapter (because it is in a class of its own) is the IPython
notebook. IPython Notebook is a part of the IPython interpreter that has been used in
previous chapters. It is an interface for Python that can incorporate markup languages
and code into a reproducible document. With an interface very similar to that of a
Mathematica notebook, the IPython (soon, Jupyter) notebook combines plain text,
LaTeX, and other markup with code input and output cells. Since the IPython note‐
book displays tables and plots alongside the code that generated them, a document in
this format is especially reproducible.
For more on IPython, Jupyter, and working with the Notebook, see the IPython web‐
site.
Publication is an essential part of bringing your work to your peers. Another way,
however, is direct collaboration. The next chapter will demonstrate how GitHub can
make collaboration on papers and software far more efficient.
460
|
Chapter 20: Publication
www.it-ebooks.info
CHAPTER 21
Collaboration
It was once the case that collaboration involved letters being sent through the mail
from scientist to scientist.
Today, collaborations happen via email, conference calls, and journal articles. In addi‐
tion to these tools, web-based content and task management tools enable scientific
collaborations to be made effortlessly across continents, in myriad time zones, and
even between scientists who have never met. Indeed, some of the first enormous
modern collaborations in the physical sciences spurred the progenitors of the collab‐
oration tools that currently exist (not least of all, the Internet). In the context of com‐
putation, issue ticketing systems can be closely tied to version control systems and
become powerful tools for peer review.
This chapter will demonstrate how such tools expedite and add peer-review capabili‐
ties to collaborative research discussions, writing papers, and developing scientific
software. These ticket management systems provide a system for content manage‐
ment alongside version-controlled repositories. Sites like GitHub, Launchpad, and
Bitbucket, which provide content management for hosted version-controlled reposi‐
tories, are essential to modern collaboration.
Additionally, this chapter will describe the interface for pull requests that allows col‐
laborators to peer review code. Transparent archiving and opportunity for review do
for scientific software what the peer-reviewed journal system does for scientific
papers. Scientific code has historically gone unreviewed and unrecognized by the sci‐
entific community. However, thanks to these new tools, software is increasingly being
seen as a bona fide scientific research product in itself, not unlike a journal article.
Without the interfaces for peer review provided by sites like GitHub, this would never
be possible.
461
www.it-ebooks.info
In Chapter 15, version control was called the “laboratory notebook” of scientific com‐
puting. In that paradigm, the tools described in this chapter allow scientists to share,
review, and collaborate on laboratory notebooks, both among themselves and with
the world.
Scientific collaboration via the post was riddled with inefficiencies, bottlenecks, and
obfuscation of provenance. If Lise Meitner, Neils Bohr, Fritz Strassman, and Otto
Hahn had had this new kind of system when they were working on the research that
would yield the theory of nuclear fission, the process of discovery would have been
expedited enormously. In a modern version of their collaboration, their communica‐
tion would have been instantaneous and the provenance of ideas and effort would
have been more transparent to the scientific community. In such a version of their
collaboration, perhaps there might have even been sufficient provenance to guarantee
a Nobel Prize for Prof. Meitner alongside her colleagues Hahn and Strassman. The
next section will discuss how an open source scientific computing project today could
rely on ticket managers.
Ticketing Systems
For any research project, computational or otherwise, a ticket management system
(sometimes called a content management system or issue tracker) can vastly simplify
collaboration. Web-based ticket management systems allow progress on a project to
be tracked and managed at the level of individual tasks by providing a web interface
for task assignments, updates, and completion reports.
Almost all web-based services for code repository hosting (e.g., GitHub, Bitbucket,
Launchpad) have an associated issue tracker. These provided methods for creating
issues or tickets associated with necessary tasks related to the repository. The resulting
dashboard is an annotated, dynamic system for to-do list curation, communication,
and peer review.
Tickets are, fundamentally, the first step in the workflow for new features, bug fixes,
and other needs within the code base. The next section will give an overview of the
workflow associated with such ticketing systems.
Workflow Overview
In the context of a collaborative research effort, community guidelines for using issue
trackers must arise organically in a way that appropriately reflects the structure and
culture of the collaboration. However, common workflow patterns have emerged in
the use of issue trackers in scientific computing projects that share a general
structure.
While many types of workflows exist, the workflow in Figure 21-1 is common for a
situation when a bug is found.
462
|
Chapter 21: Collaboration
www.it-ebooks.info
Figure 21-1. A bug resolution workflow
First, before reporting a bug, the user or developer must check the rest of the cur‐
rently open issues to determine whether it has already been reported. If the bug is not
yet known, a new issue can be created describing the bug, and the collaborators can
agree upon related goals and subtasks.
When researchers take responsibility for completing issues, those tickets are assigned
to them (or they can self-assign it). As a researcher makes progress on the completion
of the task, comments and updates can be added to the ticket. If collaborators have
questions or comments about the progress of a task, the ensuing discussion can take
Ticketing Systems
www.it-ebooks.info
|
463
place directly on the ticket through the web interface. Finally, when a conclusion is
reached, code committed to the repository can be referenced in the discussion. A pull
request holding the new changes is typically submitted, referencing one or more
issues. The new code submitted by pull request can then be reviewed, tested on multi‐
ple platforms, and otherwise quality checked. When the new code satisfies collabora‐
tors, the issue is declared solved or closed. It is typically closed by the person who
opened it, a project leader, or the person who solved it.
When a new feature is desired, a similar workflow is followed. However, the initial
steps can be quite different. Many open source projects have a notion of “enhance‐
ment proposals” that are necessary to initiate the process.
Of course, the first step in any of these workflows is to actually create the issue or
ticket.
Creating an Issue
Users and developers often find bugs in code. A bug found is better than a bug that
goes unnoticed, of course, because only known bugs can be fixed.
“Issues” on GitHub are tickets associated with a particular repository. Issues alert
code developers and users to a bug, feature request, or known failure. Primarily, issue
tickets exist to specifically designate a place for discussion and updates concerning
these topics.
A modern Otto Hahn, when faced with a peculiar result, could begin a discussion
with his colleagues in the hopes of solving the problem. Figure 21-2 shows the Git‐
Hub issue creation form through which Otto opens an issue and describes the
problem.
464
|
Chapter 21: Collaboration
www.it-ebooks.info
Figure 21-2. Hahn needs a theory
Core features of an issue
The most effective issues have a bit more information than the one in Figure 21-2,
however. In particular, issues typically answer a few key questions:
• What is needed? A description of the error or feature request.
• Why is it necessary? Sufficient information to replicate the error or need.
• What happens next? A proposal for a path forward.
• How will we know this issue is resolved? A clear end goal.
Without those features, it may be difficult for other collaborators to replicate the
issue, understand the need for a change, or move toward a solution. Furthermore, for
provenance, even more data about the issue itself can be helpful.
Issue metadata
In addition to this core information, metadata can be added to the issue that helps to
organize it and designate its place among others. Different web-based hosting plat‐
forms provide different features for defining tickets. Some of the neat features of Git‐
Hub issues include tags, user pings, cross-linking with the code base, and commit
hooks.
Tags, which are completely customizable for each project, can be used for categoriz‐
ing and differentiating groups of issues. Many tags may be used for a single issue, and
on GitHub, colors can be used creatively to help visually distinguish distinct topics.
Ticketing Systems
www.it-ebooks.info
|
465
Issue tags may be used to categorize the issue along many different axes. For instance,
they may indicate the level of importance, the degree of complexity, the type of issue,
the component of the code affected, or even the status of the issue.
Table 21-1 gives some examples of the kinds of tags you might apply.
Table 21-1. GitHub tagging examples
Importance
Difficulty
Type of issue
Code component Issue status
Critical
Expert
Bug
Installation
New
High priority
Challenging Feature
Input/output
Progress
Medium priority
Mild effort
Documentation Core
In review
Low priority
Beginner
Test
Won’t fix
Visualization
The customizability and power of this metadata are vast. In particular, it can help
code developers to decide which issues to tackle next. Together with a tool like
HuBoard or waffle.io, this metadata can even fuel dashboards for managing projects
under sophisticated workflow paradigms (e.g., “agile” or “kanban” systems).
Since so much of this metadata revolves around helping developers to approach and
handle tasks, it should make sense that the most important type of metadata in an
issue ticket is the assigned developer.
Assigning an Issue
To avoid duplicated effort, an issue can be assigned to a particular developer. Com‐
monly, in the open source world, issues are discussed among developers as soon as
they are created. Often, the developer who most clearly has expertise in that area of
the code is the one who is assigned to handle it.
An issue can also be left unassigned to indicate that it is unclaimed. Developers who
become interested in solving such issues can confidently assign them to themselves.
Grouping Issues into Milestones
On GitHub, issues can be grouped into milestones. Milestones are groups of issues
defining broader-reaching goals. Milestones also have due dates. This feature of the
GitHub interface can be used as a mechanism for project goal tracking of many kinds,
including research group organization and code release management.
466
|
Chapter 21: Collaboration
www.it-ebooks.info
Grant-driven research, in particular, is well suited for milestone-based, due date–
driven work. Additionally, by bundling all of the issues needed for a desired feature
set, milestones are ideal for defining the necessary work remaining for a code release.
Even though she was far away from Otto Hahn, a modern Lise Meitner could have
commented on and assigned herself to handle the issue he opened. In Figure 21-3,
Lise makes a comment on the GitHub issue. By clicking the “assign yourself ” link on
the right, she can claim the task.
Figure 21-3. Lise claims this task
Once she has claimed it, she can go ahead and start work on it. As she begins to
develop a theory, she may desire to bounce ideas off of her colleagues. Discussing an
issue can be done on GitHub as well, so she, her nephew Otto Frisch, and Neils Bohr
can discuss their thoughts right alongside the original question.
Discussing an Issue
To discuss an issue on GitHub, just enter a new comment into the comment box asso‐
ciated with the issue. The issue conversation is an appropriate place for:
• Asking and answering clarifying questions about the issue
• Sharing and discussing ideas for an approach
• Requesting and providing updates on the process
Ticketing Systems
www.it-ebooks.info
|
467
Some research groups are tempted to discuss issues via email rather than within the
issue tracker. While that strategy seems equivalent, it is not. Discussion directly on
the issue page is superior at retaining context, transparency, and provenance.
All that said, very open-ended discussions are typically more appropriate for the
email format. Issues are meant to eventually be closed.
Closing an Issue
When the bug is fixed or the new feature implementation is complete, the issue
should be closed. The collaboration’s cultural norms, expertise distribution, leader‐
ship hierarchies, and verification and validation requirements all affect the process by
which an issue is deemed complete.
For example, in repositories dedicated to writing a research paper (see Chapter 20),
building a research website, or prototyping a quick idea, changes might not require
strict quality assurance methods. In those situations, the issue may be closed without
much fanfare or oversight at all.
In contrast, the scientific community expects a high level of robustness and quality
from scientific software. To assure quality and accuracy, new changes to a scientific
software project may need to undergo verification, validation, and peer review. In
such a project, closing an issue may therefore involve the effort and consensus of
multiple researchers, incorporation of an automated test suite, adherence to a style
guide, and appropriate documentation.
Indeed, the level of validation and verification necessary in high-quality software
projects typically requires that the issue review culture includes a system of pull
requests.
Pull Requests and Code Reviews
Historically, software developers shared, removed, and submitted changes through
patches passed around via email. The pull request is a hyper-evolved descendant of
that technology and, indeed, carries a patch at its core. Pull requests, however, repre‐
sent an enormous leap forward for collaborative software development. Pull requests
are a reasonable, provenance-aware interface for applying peer review to proposed
patches.
Chapter 15 demonstrated the power of version control for tracking small changes.
Importantly, a patch is just such a small change. Recall from Chapter 1 that the differ‐
ence between two files can be output to the terminal. Additionally, recall that any out‐
put can be redirected to a file instead of the terminal. The resulting file represents the
difference between two files. It is called a patch because the patch command is used
to apply that difference to the original file (resulting in the modified file).
468
|
Chapter 21: Collaboration
www.it-ebooks.info
Submitting a Pull Request
With the pull-request interface, however, a researcher can submit a change for review
in a clean interface that links to actual commits, allows line comments, and persists
alongside the code on the GitHub servers.
In Lise Meitner’s case, perhaps the project repository might have held a text docu‐
ment outlining the working theory and experimental description for the project. To
make changes, Lise first forked the main repository, under the kaiserwilhelm user‐
name, then cloned it locally:
$ git clone git@github.com:lisemeitner/uranium_expmt
She might have solved the issue that Otto Hahn opened by creating a branch (see
“Listing, Creating, and Deleting Branches (git branch)” on page 365), “newtheory,”
and editing the text file there:
$ git checkout -b newtheory
At this point, she might choose to delete some of the text that incorrectly described
the theory, and to add lines that outline her theory of fission. After editing the file
and committing her changes, she can push that branch up to her fork on GitHub (see
“Downloading a Repository (git clone)” on page 375). In the directory containing her
local copy of the repository, Lise might perform the following to push her feature
branch up to her fork:
$ git commit -am "edits the old theory and replaces it with the new theory."
$ git push origin newtheory
Once she has pushed the branch up to GitHub, Lise can navigate within a web
browser to the dashboard of her repository. There, GitHub provides the option to
make a pull request to the master branch of the main kaiserwilhelm repository. When
that button is clicked, the pull request appears as a new issue in the kaiserwilhelm
repository, where it should be reviewed by collaborators before being merged into the
code base.
Reviewing a Pull Request
Reviewing a pull request is much like reviewing a paper. More accurately, it should be
like reviewing a section or paragraph of a paper. Humans are better at reviewing short
paragraphs of code rather than hundreds of lines at once—too much for us to hold in
our heads at once.1 For this reason, developers should avoid lengthy or complex pull
requests if possible. By addressing changes in an atomistic fashion (one bug fix or fea‐
1 See Edward R. Tufte’s The Visual Display of Quantitative Information (Graphics Press).
Pull Requests and Code Reviews
www.it-ebooks.info
|
469
ture addition at a time), developers reduce the likelihood of introducing a bug that
can be missed in the review stage.
At this stage, developers reviewing the pull request may ask a number of questions.
Does the code:
• Accomplish the goals?
• Introduce bugs?
• Include sufficient tests?
• Follow the style guide?
• Pass the existing tests?
• Pass new tests?
• Pass the tests on other platforms (Unix, Windows)?
Merging a Pull Request
Once reviewed, the code can be merged. This can be done in one of two ways. On
GitHub, within the pull request itself, there is a green button for merging noncon‐
flicting pull requests.
Alternatively, via the command line, a developer can use a combination of git
remote, git fetch, git merge, and git push. Review Chapter 16 to recall how these
commands are used.
Collaboration Wrap-up
Collaboration can be a very complex, time-consuming element of scientific work—
especially with old technology. However, readers of this book should now be
equipped to collaborate more efficiently using the power of Git and GitHub. In this
chapter, you have seen how to create, assign, discuss, and tag issues, as well as how to
generate solutions, make pull requests, review code, and incorporate changes effi‐
ciently online.
This efficiency should free up time for determining what license is best for distribut‐
ing your code with. For help with that, keep reading into the next chapter.
470
|
Chapter 21: Collaboration
www.it-ebooks.info
CHAPTER 22
Licenses, Ownership, and Copyright
For any software project, the most important file in the project is the license. This file
states who owns the work, who is allowed to use the project and under what condi‐
tions, and what rights and guarantees are conferred to both the users and the owners.
If a license file is not present, it is conspicuous in its absence. Since the license is the
most important file in a project, this chapter is the most important one in this book to
read and fully understand.
License files should be easy and obvious to find. Most of the time they appear in the
top-level directory and go by the name LICENSE, license.txt, or another variant. Note
that sometimes different parts of a project are provided under different licenses.
Some projects also have the dubious practice of being licensed differently depending
on how they are used. Be sure to read and understand the license of any software
project that you use before you use it.
Get a Laywer
This chapter is not legal counsel. We are not qualified to help you
in a formal dispute. For that, you need to have a lawyer.
This chapter is only intended to provide friendly advice that aims to help you under‐
stand the basic concepts in publishing a creative work. Having a good grasp of these
fundamentals will help you make informed decisions. If for any reason you do end up
needing legal counsel in this area but do not know where to start, you can contact the
Electronic Frontier Foundation (EFF), the Software Freedom Conservancy (SWC),
Creative Commons (CC), the Free Software Foundation (FSF), or Numfocus (NF);
they may be able to help. Each of these organizations has experience with the legal
471
www.it-ebooks.info
aspects of software development and should be able to point you in the right direc‐
tion, at the very least.
The licenses discussed in detail in this chapter will mostly be open source licenses.
This is because without peer review of software and data, scientific code cannot fairly
be called reproducible. If something is not reproducible, then it is de facto not sci‐
ence, no matter how deeply it covers a scientific topic. Equal dissemination of knowl‐
edge is critical to the scientific method. This is not to diminish the technical prowess
of closed source code at all—propietary software is frequently among the most
sophisticated. It is just not science in the benefit-for-all-of-humanity-for-all-of-time
way. Open source licenses are ideally suited to research software.
This chapter will cover ideas and terms that you are probably already familiar with in
their common usage. Here, we seek to improve upon that lay understanding to fur‐
ther the aims of computational physics.
What Is Copyrightable?
Before we talk about licenses in depth, it is important to understand what they cover.
In western jurisprudence, from which most copyright law around the world stems,
ideas and concepts are not copyrightable. However, expressions of ideas are
copyrightable.
For instance, copyright does not apply to physical laws of nature and mathematical
facts. The number pi and the idea that it is the ratio between the area of a circle and
the square of its radius is not something that any human can claim ownership of.
Humans discovered this knowledge, but humans did not create it nor invent it. Pi just
is. Now, if I were to bake a pie with the letter pi cooked into the crust and the digits
proudly displayed around the perimeter and then took a picture of my handiwork, I
would have copyright over the picture. This tasty expression of pi would be uniquely
my own. Anyone claiming otherwise would be wrong.
The same logic applies even outside the world of strictly scientific endeavors. For
example, game rules are abstract concepts that are not copyrightable. However, any
published version of a game that you read is a copyrighted expression of those rules.
The rules for chess, Go, mancala, poker, bridge, basketball, rugby, cricket, The Settlers
of Catan, and Dungeons & Dragons are all not copyrightable. They are just ideas. That
said, the rule book that comes with any of these games is a creative and particular
expression of the rules and is subject to copyright law.
In software, the breakdown between what is and is not copyrightable is the distinc‐
tion between the implementation and the interface. The application programming
interface (API) is considered to be a set of ideas and therefore not copyrightable. The
actual implementation of an API, or how the actual work is performed, is copyrighta‐
ble. There are many possible implementations for any interface, and so any given
472
|
Chapter 22: Licenses, Ownership, and Copyright
www.it-ebooks.info
implementation is a unique expression. For example, say we wanted a function
named std() that computed the standard deviation of a list of values named vals.
These name choices and the concept of what the function is supposed to do make up
the interface, which is not copyrightable. That said, any code that computes the stan‐
dard deviation with this interface is copyrighted. There is a fundamental distinction
both conceptually and legally between how one uses software and how that software
is written. Keep this distinction in mind as you read through the following sections.
Right of First Publication
Now that we know what can be copyrighted, we should understand when copyright
applies. Most copyright systems feature what is called the right of first publication.
This is the idea that copyright automatically goes to the first publisher of a creative
work. This right is conferred whether or not it is specified by the publisher at the
time. Such laws protect publishers from having their work stolen as long as they can
demonstrate that they were there first.
This has important implications in the information age, where self-publishing is nor‐
mal behavior. Anyone who writes a blog post, tweets out “Happy Birthday, Mom!” or
puts code on GitHub personally retains the copyright to that work via the right of
first publication. Your work is your own unless you give it up.
In software, say you post a piece of code without a license. By rights this code is yours
and yours alone, even though it is publicly visible. You own the copyright, and you
have not specified how others are allowed to use your code. By default, legally, they
are not entitled to use it at all. If your intent was to share your code for reproducibil‐
ity, provenance, education, or general scientific goodwill, by not having a license you
have undermined your intended purpose.
Software licenses are important because they allow you to retain copyright and they
state the terms by which other people or organizations are entitled to use or modify
your software. That said, it is possible to completely forego all rights.
What Is the Public Domain?
What happens if you do not want to deal with licenses or you do not want to retain
copyright? If you just want the code that you produce to be for the unrestricted bene‐
fit of all, yielding completely to the better nature of scientific discourse? It is possible
to put a work into the public domain (PD) with a simple statement along the lines of,
“This work has been placed in the public domain.”
The public domain is a concept that society as a whole “owns” a work, and it is there‐
fore free and available for anyone and everyone to use and modify. Since everyone
owns it, nobody owns it. In most cases, the copyright of an existing work will expire
Right of First Publication
www.it-ebooks.info
|
473
after a set number of years (25, 50, 90, etc.), at which point the work will enter the
public domain. The public domain is what allows anyone to republish the collected
works of Mark Twain, Socrates’s Apology, or Mary Shelley’s Frankenstein. However,
just because copyright will expire naturally does not mean that you have to wait that
long. You are free to add your own works to the public domain sooner if you so
desire.
That said, the public domain is one of the trickiest parts of international copyright
law. Not every country has a public domain that is compatible with the notions
expressed here. In some countries it may not be possible for an individual to place a
work into the public domain prior to the expiration of copyright. It is important to
understand the laws of the country where you live. Wikipedia is often a good first
resource. For anything deeper, you should probably seek out legal counsel. If you are
at a university, national lab, or private company, your organization will often have
resources available for you to use.
If you do not want to put your software in the public domain, but do want it to be
free and open, you have to do the work of picking a license.
Choosing a Software License
A license is a legal document that states how software is allowed to be used by its
users and what rights the author retains, and serves to protect both the users and the
authors. Without such a document, only the original author or publisher has any
right to use or modify the code. Having a software license that accurately reflects your
needs and the needs of your potential users is extremely important.
A variety of licenses have been created over the years, tailored to different situations.
At the broadest level, there are proprietary licenses and free/open source licenses. Pro‐
prietary licenses are usually written by companies that sell software. The Microsoft
Windows End-User License Agreement (EULA) is an example of such a document.
They typically proclaim the owner of the copyright to be the company, disclaim dam‐
ages if the software is abused, and promise litigation in the event of piracy. They are
often handcrafted by a team of lawyers to minimize the exposure to the company.
Free and open source software licenses, sometimes abbreviated as FOSS, FLOSS, or
OSS, are much more relevant to computational physics software, especially if it is pri‐
marily for research. Research-grade software typically has the following attributes:
• It does not have immediate and direct commercial interest.
• It must have source code inspectable by peers for review.
• It changes rapidly to fit the needs of the researcher.
474
| Chapter 22: Licenses, Ownership, and Copyright
www.it-ebooks.info
Licenses that give users and other developers the freedom to look at and modify code
encourage scientific discourse, education, comparison, quality assurance, and partici‐
pation in the project. Since most researchers do not have the funds available to hire
thousands of developers to do all of these activities, an open source license can help
establish a community of users and developers. Fellow scientists can pool together to
help in these essential tasks that make a software project successful. The exchange is
often nonmonetary. One person provides code, and others may help make that code
better. To be safe, open source licenses often explicitly include a no-warranty clause.
You cannot sue someone for damages because their open source code somehow
harmed your machine.
We will not cover proprietary licenses more here. There are a lot of them, since
almost every software product has its own license. On the other hand, the open
source world has converged on a much smaller number of licenses. However, there
are still far too many open source licenses to go into great depth on all of them. Here,
we will present only the most important and the most interesting ones. For a more
comprehensive review of open source licenses, please see the Open Source Initiative’s
(OSI) website or the GNU commentary page.
It is highly advisable to use an off-the-shelf open source license. Do not attempt to
write your own. Partly this is because you are not a lawyer, and it would be a waste of
your time. More importantly, licenses are not generally considered trustworthy until
they have been proven in court. This means that one party broke the terms of a
license, another party sued, and the license was upheld in a court of law. This is an
expensive and time-consuming process. Relatively few open source licenses have
gone through this crucible, but almost all of them that have survived the journey. Any
license you write will not have this benefit.
The choice of license can have a deep and lasting effect on the community that devel‐
ops around a code project. Given its importance, picking the right license receives
surprisingly little attention from developers. It is a core part of the social aspect of
software development. Everyone should know about the license and its implications,
not just the law nerds on a project. If you ever need help picking, the excellent Choo‐
seALicense.com will help you along your way.
Let’s examine some key licenses now to find out precisely what it means to be open
source.
Berkeley Software Distribution (BSD) License
The Berkeley software distribution or BSD license is actually a collection of three pos‐
sible licenses known as the BSD 4-Clause, 3-Clause, and 2-Clause licenses, respec‐
tively. Historically, the 4-clause is the oldest and the 2-clause is the most recent. The
4-clause is not recommended anymore, though both the 3- and 2-clause versions are
Berkeley Software Distribution (BSD) License
www.it-ebooks.info
|
475
commonly used. Of all of the licenses that we will discuss, either the 3- or 2-clause
license is recommend for use in your software projects. These are the licenses best
tailored to science and research. Major projects such as NumPy, SciPy, IPython, and
the rest of the scientific Python ecosystem use either of these licenses. The text is as
follows:
Copyright (c) ,
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Copyright (c) ,
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
476
|
Chapter 22: Licenses, Ownership, and Copyright
www.it-ebooks.info
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The views and conclusions contained in the software and documentation are those
of the authors and should not be interpreted as representing official policies,
either expressed or implied, of the Project.
The BSD licenses are known as permissive licenses. This is because they allow the fur‐
ther distribution of the code to be done under any license. Only the copyright notice
needs to be displayed. Additionally, further versions of the code need make no prom‐
ises as to what license they will be released under. Modifications to the code may be
released under any license that the author of the modification desires. Permissive
licenses give a lot of freedom to users and developers other than the original author,
while protecting the original author from liability.
For example, suppose person B copies person A’s code. A originally released the code
under a BSD license. B wants to modify the code and relicense the whole new code
base. B is completely free to do so, and does not even have to include a copy of the
original BSD license. The only requirement is to include the copyright notice, “Copy‐
right (c) , Person A.” This ensures that person A gets credit for the work, but
without a lot of headache.
The freedom to modify and relicense is a major reason why BSD is the recommended
license for scientific computing. It leaves the most number of options open to future
scientists. The MIT license is considered to be equivalent to the BSD 2-clause and is a
perfectly reasonable substitute. Up next is one of BSD’s main competitors.
GNU General Public License (GPL)
The GNU General Public License (GPL) is again a collection of three distinct licenses:
GPLv1, GPLv2, and GPLv3. Additionally, there are v2.1 and v3 GNU Lesser General
Public Licenses (LGPLs). These are compatible with the corresponding GPLs of the
same major version but are closer in spirit to BSD and MIT licenses. All of the GPL
options are promoted by the FSF for both GNU and non-GNU software projects.
GPLv1 is out of date and should not be used. There remains debate over whether v3 is
an improvement over v2.
Linux is almost certainly the largest project that uses GPLv2, and it will continue to
do so until the end of time. The GNU Compiler Collection (GCC) is likely the largest
project to use GPLv3. The texts of both GPLv2 and GPLv3 are too long to include
here. However, the following preamble should be added to the top of every GPLv3
file:
Copyright (C)
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
GNU General Public License (GPL)
www.it-ebooks.info
|
477
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see .
Unlike BSD, the GPL licenses are not permissive. They instead exemplify a class of
licenses known as copyleft by their punning proponents, or viral by their disinterested
detractors. In a copyleft license, modifications to the code must be licensed in the
same (or a similar) way as the original code. For open source code, this means that
any third party that forks your code also has to make the fork open source. In princi‐
ple, this is a good idea. Over time, it builds up an ecosystem of programs that are all
open source and work well with one another. In practice, however, sometimes this
requirement limits the freedom of the developers to the point where an improve‐
ment, modification, or fork will not even be written in the first place.
In general, GPL licenses are a good and reasonable choice for software projects. How‐
ever, the potential disincentives to contribution can be a barrier to entry for small and
medium-sized projects. Since physics programs—even when enormously popular—
are never large, GPL is not recommended for default use. There may still be situations
where it is the best choice, though. It’s important to always carefully consider your
options.
Another important concept for this suite of licenses is GPL compatibility. The FSF
defines compatibility as whether different portions of a code base can be released
under different licenses, where one license is the GPL. A license need not be copyleft
itself to be compatible with the GPL. Notably, the BSD 3- and 2-clause licenses are
both GPL compatible.
Even though the GPL may not be ideally suited to scientific endeavors, it is still a
wildly successful series of licenses that have been proven in court. In general, if you
want GPL-style copyleft but also want something less restrictive in terms of redistrib‐
ution, the LGPL offers a middle path that should be treated as a serious alternative.
Permissive and copyleft licenses are the main players in open source software. How‐
ever, there are also licenses that may be used for generic creative works and are not
restricted to software that may be appropriate.
Creative Commons (CC)
The Creative Commons (CC) license suite is an alternative that applies not only to
software but, more broadly, to all creative works. This includes poetry, prose, film,
478
|
Chapter 22: Licenses, Ownership, and Copyright
www.it-ebooks.info
audio, and more. The CC suite is now in its fourth instantiation, sometimes called v4.
The goal of CC licenses is to make the ideas of sharing and collaboration evidenced in
open source software more broadly applicable to other endeavors. For example, all
content on Wikipedia is licensed under CC. This has been a remarkably successful
suite of licenses in its own right.
Creative Commons licenses are all quite lengthy, so it is not possible to print them
here. Their length comes partly from the fact that the licenses are designed to keep
the same spirit in every country in the world. As you might imagine, this can make
the documents quite wordy. They are a tremendous feat of legal engineering.
The CC licenses are distinguished from one another through one or more of the fol‐
lowing four modifiers:
BY (Attribution)
Other people must credit you when they use your work. This applies to all CC
licenses.
SA (ShareAlike)
Other people must reshare the work under the same license if they use or modify
your work.
ND (NoDerivatives)
Other people may share your work but are not allowed to modify it in any way.
NC (NonCommercial)
The work may be modified and shared as long as the result is not sold
commercially.
The six licenses that Creative Commons supplies are thus known under the following
monikers: CC BY, CC BY-SA, CC BY-ND, CC BY-NC, CC BY-NC-SA, and CC BYNC-ND. Naturally, ND and NC cannot be applied to the same license.
The CC BY license fills the same role as the BSD license. It has permissive terms that
allow modification of the work as long as credit is given where credit is due. The CC
BY-SA license fills the same role as the GPL or the LGPL: it adds copyleft provisions
to the attribution clause. Both of these are great choices for a scientific computing
project, depending on whether you need copyleft or not.
The ND and NC modifiers, on the other hand, are considered extremely harmful to
scientific software. They may have a place in other domains, such as the arts (notably,
the webcomic xkcd is released under a CC BY-NC license, which is why you do not
see any witty stick figures appearing in this book). However, for physics programs,
the inability to modify the software effectively renders the open source aspects use‐
less. This is true even with just commercial applications restricted. Such license terms
hamstring the development of vibrant science and software communities. Regard any
Creative Commons (CC)
www.it-ebooks.info
|
479
use of either ND or NC with extreme prejudice. Luckily, examples of their use are
very rare in computational science.
Lastly, Creative Commons supplies a public domain substitute called CC0 (pro‐
nounced see-see-zero). In many ways, licensing code as CC0 is better than placing it
in the public domain because of the consistency problems with the public domain.
CC0 applies everywhere, including in countries where analogous public domain laws
do not exist. Moreover, in countries with a robust public domain, the CC0 license
effectively reduces to the public domain. Of all of the licenses that are discussed here,
CC0 is the freest. No major code projects of the size of Linux use this license, though
it is slowly gaining traction.
In summary, CC BY, CC BY-SA, and CC0 are all reasonable choices for a scientific
software project. The others should be avoided due to their implications being anti‐
thetical to evolving a scientific community.
The next section discusses licenses that you should not use, but which highlight inter‐
esting aspects of open source software.
Other Licenses
There are many other licenses that we are not able to cover in depth. A large number
of these fall into the broad categories of either being permissive or copyleft, though
with their own unique wording. Some of these have been defended in court. Unlike
the previously discussed licenses, most of these were initially written to apply to a sin‐
gle code project.
For instance, the Apache License was written for the Apache web server and is sup‐
ported by the Apache Software Foundation. It has a lot of use outside of Apache itself.
Unlike other licenses, it contains clauses pertaining to patents.
The Python Software Foundation License is a license that applies to the CPython
interpreter and some other Python Software Foundation code. It occasionally sees
some use in other parts of the Python ecosystem but is not commonly used outside of
that. It is a permissive license that is GPL compatible.
On the more problematic side is the JSON License. This is easily one of the most
commonly broken and flagrantly abused licenses in the world. The full text of the
license reads:
Copyright (c) 2002 JSON.org
Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge,
publish, distribute, sub-license, and/or sell copies of the Software,
480
|
Chapter 22: Licenses, Ownership, and Copyright
www.it-ebooks.info
and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
The Software shall be used for Good, not Evil.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
The provision for Good and against Evil is considered indefensible due to the subjec‐
tive nature of these terms. While the broad strokes of good and evil can be largely
understood and agreed upon, the specifics, even in a strictly utilitarian context, can
get murky quickly. How any of this applies to software at all is even less well under‐
stood. The JSON License brings into relief the supposition that other open source
software allows for “more evil.”
Even though the pursuit of evil is antithetical to most open source software develop‐
ers and scientists, explicitly disallowing it is not reasonable. It is an unenforceable
claim. Suppose someone really wanted to do bad things and felt the need to use JSON
in order to accomplish the nefarious tasks. Would not the first evil act simply be to
break the JSON license? It ends up being prescriptive, not preventative. Suffice it to
say that no one has ever been successfully prosecuted for breaking the “Good, not
Evil” clause. This license serves as a cautionary tale as to why you should never write
your own license—it probably will not turn out like you thought.
Lastly, the FLASH code from the University of Chicago has a particularly interesting
license. FLASH is a plasma and high-energy density physics program. FLASH is open
source, though it is not directly redistributable by the user. Furthermore, users need
to sign a document registering themselves with the Flash Center prior to being given
access to the code. Item 7 of the FLASH license is unique in its academic nature:
7. Use Feedback. The Center requests that all users of
the FLASH Code notify the Center about all publications that
incorporate results based on the use of the code, or modified
versions of the code or its components. All such information
can be sent to info@flash.uchicago.edu.
Here we see an attempt to gather statistics about the impact factor of scientific soft‐
ware being enforced through the license itself. Given the 850+ publications found on
the FLASH website, this strategy has been very successful. The other nonstandard
provisions of the FLASH license are all justified in their own right, but it is not rec‐
Other Licenses
www.it-ebooks.info
|
481
ommended that you adopt licenses such as the FLASH one (indeed, the authors are
not aware of another implementation of this kind of license). Your code is likely not
in a situation like that of the FLASH code. Using one of the stock licenses from the
previous sections is much easier for scientific programs. Uniqueness is not an asset
for software licenses.
Now that we have seen many license types, including permissive, copyleft, and more
exotic ones, it is reasonable to wonder how an existing project might transfer from
one license to another.
Changing the License
Occasionally developers will decide that it is time to change the license that their code
is distributed under. This can happened because of newfound legal implications in
the license, because the broader ecosystem has moved away from a license, or, more
commonly, because it is felt that the project will be easier to sustain under an alterna‐
tive license. One prominent and successful example of a computational physics
project that has undergone relicensing is yt; you can read about the campaign on the
project blog.
Even under the best conditions, changing a license is a long and painful process. It
has a high likelihood of failure even if all of the current developers are on board with
the decision.
Relicensing an open source project requires that you obtain consent from all previous
contributors to make the switch. Consent can be given either actively (sign a waiver,
send an email, etc.) or passively (if we do not hear from you in three months, we will
assume you do not object). If consent is not given by a former or current developer,
either the code that the developer worked on must remain with its original license, or
all of that person’s code must be rewritten to conform with the new license. Both of
these options are their own headaches.
If enough people do not agree with the relicensing activity, it becomes too much of a
hassle to actually relicense the code. In such cases it is easier to start from scratch
than to try to force the issue with the existing code base. The scorched-earth strategy
happens more frequently than many people would like to admit. Many of the worst
horror stories come from trying to switch between copyleft and permissive licenses,
or vice versa.
That said, relicensing is not impossible. It is just hard. When done successfully it will
take at least three or four months. The steps to proceed through a relicense are as
follows:
1. Privately and tentatively, present the idea of the new license to current core devel‐
opers. Make the case for the new license as a possibility and get a feel for the level
482
| Chapter 22: Licenses, Ownership, and Copyright
www.it-ebooks.info
of support. Solicit feedback from this team first. If the current core team is not
interested, stop.
2. Publicly present the option for hypothetically changing the license based on the
previous feedback to all core developers. Solicit opinions from the developers
and gauge the interest. If all members of the core team are happy with the change
and most of the current developers are either for it or do not care, proceed.
Otherwise, stop.
3. Publicly present the option for changing the license to the users. If the project
will lose a significant minority of users due to the change, stop. Users are your
most precious resource, and changing the license is not worth the setback.
4. Promise to change the license only on the following and further releases. Do not
apply the relicense retroactively to previous releases.
5. Obtain written consent from all current and past developers that the relicense is
acceptable. An email suffices. Put a healthy time limit on the opportunity to sub‐
mit the consent form. This allows for the opportunity to raise objections. A
month is typically enough. Still allow the relicensing to fail at this point.
6. Relicense the code on the next release after waiting an appropriate amount of
time.
There are many places where even a single person who has not worked on the project
in a decade can block this entire process. It also is not a technical concern that deals
with how the code operates. There are no right or wrong answers. This is a recipe for
hurt feelings and battered egos. You should enter into such an effort treading as softly
and as slowly as possible.
Even with the license’s chief position in a software project, you also need to under‐
stand its limitations.
Copyright Is Not Everything
Copyright and licensing are not all that there is to what is broadly known as intellec‐
tual property. Patents cover temporary monopolies over ideas and processes. The
original idea behind patents was to allow creators, inventors, and innovators a brief
opportunity to bring their ideas to market. Patenting software has recently come
under intense scrutiny due to “patent trolls” who hoard patents but do not attempt to
produce anything but lawsuits. Trademarks also fall into the realm of intellectual
property. They are recognizable symbols of a business or organization that uniquely
identify that group in a specific domain. Trademarks must be continually used or they
lapse and may be taken up by another group. Some people believe that it’s incorrect to
use the umbrella term “intellectual property,” as copyright, patents, and trademarks
Copyright Is Not Everything
www.it-ebooks.info
|
483
are all distinct legal areas. Each of these requires a lawyer that specializes in that area.
Still, for us plebs, the term has stuck.
Intellectual property is not the only instrument of control over software. In fact, most
intellectual property does not truly apply to computational physics. Other mecha‐
nisms are much more effective and more present in many physics software projects.
Export control is a particularly strong system. This is when the government steps in
and forbids the transfer of source code to any other country or foreign national
without explicit permission. Typically even citizens who have access must sign a
document stating that they promise not to violate the terms of the export restrictions.
Export control is particularly powerful because it applies not only to software, but
also to data and ideas. As an extreme and unreasonable example, the government
could restrict you from telling anyone the value of pi on the basis of export control.
When a program is export controlled, it cannot just be put on the Internet. The rules
for sharing become much more complicated. Violating export control typically comes
with a long jail sentence. It is important to take it very seriously. In general, this is not
something you ever want to deal with personally.
Physics programs are subject to export control far more frequently than other soft‐
ware. The typical reason that a physics program becomes export controlled is because
it solves a class of problems that make it easier to build various kinds of highly
restricted weaponry. Furthermore, some contain or produce data that is considered
sensitive or secret and therefore may not be shared. Mathematics and computer sci‐
ence programs are sometimes export controlled due to their applications to cryptog‐
raphy. Be aware of and abide by your government’s laws regarding export control.
Lastly, in the United States there is the Health Insurance Portability and Accountabil‐
ity Act of 1996, or HIPAA. Since this deals with medicine, it comes up only occasion‐
ally in physics software. Software that deals with human patients must be properly
anonymized to ensure the public’s right to privacy. This is known as HIPAA compli‐
ance. The Department of Health and Human Services is responsible for overseeing
such compliance. Other countries have similar laws.
These alternative structures can be extraordinarily effective at limiting what can be
done with software. They are not always in the control of the author, either. While
patents and trademarks and other concerns in intellectual property may not apply,
HIPAA rules and privacy violations can be extraordinarily incriminating. And every
computational physicist needs to be cognizant of export control rules at all times.
Momentary lapses in good judgment are not allowed with respect to export control.
484
|
Chapter 22: Licenses, Ownership, and Copyright
www.it-ebooks.info
Licensing Wrap-up
You have now seen a wide swath of the legal issues that surround software develop‐
ment and how they apply to computational physics. You should be familiar with the
following ideas:
• The license document is the most important file in your project.
• Scientific software should be free and open source.
• The permissive BSD licenses are recommended for most computational science
projects.
• The copyleft GPL licenses are also reasonable choices.
• The public domain is a great alternative to licensing your code, but it does not
apply everywhere. Use CC0 as a substitute for the public domain.
• Creative Commons licenses can apply to more than just software.
• You should not write your own license.
• Relicensing a project can be very difficult.
• It is important to be aware of the export control laws in your country.
Now that you’ve made it to the end of the book, the next chapter dives into some
parting thoughts about computational physics.
Licensing Wrap-up
www.it-ebooks.info
|
485
www.it-ebooks.info
CHAPTER 23
Further Musings on Computational Physics
At last, you have arrived! You are now ready to go forth into the wide world of com‐
putational physics. No matter where your specialty may take you, you now have the
skills, abilities, and understanding to perform and reproduce great feats of scientific
computing. For some of you, this book is all you will need to succeed. For others, this
is only the beginning.
Where to Go from Here
What is so beautiful about the skills that you have learned in this book is that they
empower you to go anywhere. Computational physics has taken us from deep within
the Earth’s crust to the farthest reaches of the universe, from pole to pole, all around
the world, and everything in between. Even asking “Where to?” can seem daunting.
The answer is that you should go where your interests lie. If you have a special part of
physics that you already call home, research what computational projects are out
there already. Then try a project out as a user. Join the mailing list. Ask the developers
if they need help with anything, and try contributing back. A good project will be
very welcoming to new users and contributors.
If there is nothing out there that does what you want, you don’t like the languages that
the existing projects are written in, or you don’t agree with their licenses, try starting
your own project—one that suits your needs. This is not scary, and with the existence
of repository hosting websites like GitHub, it has become very easy. Starting a new
project is a great way to hone your software architecture skills while also clarifying
what part of physics you find most interesting.
The following is a list of projects that might pique your interest. We have grouped
them according to subdomain. Most, but not all, of these have strong Python and
physics components. Here they are, in their own words:
487
www.it-ebooks.info
• Astronomy and astrophysics
— yt: A Python package for analyzing and visualizing volumetric, multiresolution data from astrophysical simulations, radio telescopes, and a bur‐
geoning interdisciplinary community.
— Astropy: A community effort to develop a single core package for astronomy
in Python and foster interoperability between Python astronomy packages.
— SunPy: The community-developed, free and open source solar data analysis
environment for Python.
• Geophysics, geography, and climate
— UV-CDAT: A powerful and complete frontend to a rich set of visual data
exploration and analysis capabilities well suited for climate data analysis
problems.
— Iris: A Python library for meteorology and climatology. The Iris library imple‐
ments a data model to create a data abstraction layer that isolates analysis and
visualization code from data format specifics.
— ArcPy: Python for ArcGIS.
— Shapely: A Python package for set-theoretic analysis and manipulation of pla‐
nar features using functions from the well-known and widely deployed GEOS
library.
• Nuclear engineering
— PyNE: A suite of tools to aid in computational nuclear science and engineer‐
ing. PyNE seeks to provide native implementations of common nuclear algo‐
rithms, as well as Python bindings and I/O support for other industrystandard nuclear codes.
— OpenMC: A Monte Carlo particle transport simulation code focused on neu‐
tron criticality calculations. It is capable of simulating 3D models based on
constructive solid geometry with second-order surfaces. The particle interac‐
tion data is based on ACE format cross sections, also used in the MCNP and
Serpent Monte Carlo codes.
— Cyclus: The next-generation agent-based nuclear fuel cycle simulator, provid‐
ing flexibility to users and developers through a dynamic resource exchange
solver and plug-in, user-developed agent framework.
• Physics
— QuTiP: Open source software for simulating the dynamics of open quantum
systems.
— Trackpy: A Python package providing tools for particle tracking.
• Mathematics
488
| Chapter 23: Further Musings on Computational Physics
www.it-ebooks.info
— FiPy: An object-oriented, partial differential equation (PDE) solver, written in
Python, based on a standard finite volume (FV) approach.
— SfePy: Software for solving systems of coupled partial differential equations
(PDEs) by the finite element method in 1D, 2D, and 3D.
— NLPy: A Python package for numerical optimization. It aims to provide a
toolbox for solving linear and nonlinear programming problems that is both
easy to use and extensible. It is applicable to problems that are smooth, have
no derivatives, or have integer data.
— NetworkX: A Python-language software package for the creation, manipula‐
tion, and study of the structure, dynamics, and functions of complex net‐
works.
— SymPy: A Python library for symbolic mathematics. It aims to become a fullfeatured computer algebra system (CAS) while keeping the code as simple as
possible in order to be comprehensible and easily extensible.
— Sage: A free, open source mathematics software system licensed under the
GPL.
• Scientific Python
— IPython: A rich architecture for interactive computing.
— NumPy: The fundamental package for scientific computing with Python.
— SciPy: The SciPy library is one of the core packages that make up the SciPy
stack. It provides many user-friendly and efficient numerical routines such as
routines for numerical integration and optimization.
— Pandas: An open source, BSD-licensed library providing high-performance,
easy-to-use data structures and data analysis tools for the Python program‐
ming language.
— matplotlib: A Python 2D plotting library that produces publication-quality
figures in a variety of hardcopy formats and interactive environments across
platforms.
— PyTables: A package for managing hierarchical datasets, designed to effi‐
ciently and easily cope with extremely large amounts of data.
— h5py: A Pythonic interface to the HDF5 binary data format. It lets you store
huge amounts of numerical data, and easily manipulate that data from
NumPy.
— PyROOT: A Python extension module that allows the user to interact with
any ROOT class from the Python interpreter.
— lmfit: A high-level interface to non-linear optimization and curve fitting prob‐
lems for Python. Lmfit builds on the Levenberg-Marquardt algorithm of
Where to Go from Here
www.it-ebooks.info
|
489
scipy.optimize.leastsq(), but also supports most of the optimization
methods from scipy.optimize.
— scikit-image: A collection of algorithms for image processing.
• Enthought Tool Suite (ETS): A collection of components developed by
Enthought and its partners, which can be used every day to construct custom sci‐
entific applications.
— Mayavi: An application and library for interactive scientific data visualization
and 3D plotting in Python.
— ParaView: An open source, multiplatform data analysis and visualization
application. ParaView users can quickly build visualizations to analyze their
data using qualitative and quantitative techniques. The data exploration can
be done interactively in 3D or programmatically through ParaView’s batch
processing capabilities.
— VisIt: An open source, interactive, and scalable visualization, animation, and
analysis tool. From Unix, Windows, or Mac workstations, users can interac‐
tively visualize and analyze data ranging in scale from small desktop-sized
projects to large leadership-class computing facility simulation campaigns.
— Vispy: A high-performance interactive 2D/3D data visualization library. Vispy
leverages the computational power of modern graphics processing units
(GPUs) through the OpenGL library to display very large datasets.
— Cython: An optimizing static compiler for both the Python programming lan‐
guage and the extended Cython programming language. It makes writing C
extensions for Python as easy as Python itself.
• Pedagogy and community
— Software Carpentry: A nonprofit membership organization devoted to
improving basic computing skills among researchers in science, engineering,
medicine, and other disciplines. Its main goal is to teach researchers software
development skills to help them do more, in less time and with less pain.
— NumFOCUS: A nonprofit foundation that supports and promotes worldclass, innovative, open source scientific software. NumFOCUS aims to ensure
that money is available to keep projects in the scientific Python stack funded
and available.
— Mozilla Science Lab: A Mozilla Foundation initiative that helps a global net‐
work of researchers, tool developers, librarians, and publishers collaborate to
further science on the Web.
— PyLadies: An international mentorship group with a focus on helping more
women become active participants and leaders in the Python open source
community.
490
|
Chapter 23: Further Musings on Computational Physics
www.it-ebooks.info
— Open Hatch: A nonprofit dedicated to matching prospective free software
contributors with communities, tools, and education.
Now, go forth and make waves!
Where to Go from Here
www.it-ebooks.info
|
491
www.it-ebooks.info
Glossary
absolute path
The full, unambiguous path from the root
directory at the top of the tree, all the way
to the file or directory being indicated.
API
Application programming interfaces are
the public-facing functions, methods, and
data with which users and developers
interact.
assertion
An assertion, in software, is an operation
that compares two values. If the assertion
returns a false Boolean value, a runtime
exception is thrown.
assignment
Assignment statements apply values to
variable names. Most modern languages
use = to denote assignment. Notably, R
uses left and right arrows (<- and ->).
attributes
Any variable may have attributes, or attrs,
which live on the variable. Attributes are
also sometimes known as members. They
may be accessed using the binary . opera‐
tor.
awk
A shell program for manipulation and
analysis of column-based text files.
bash
The Bourne Again SHell, which combines
features from previous shells and seeks to
provide a scripting language for interact‐
ing with an operating system through a
terminal emulator.
binary
An operator that takes exactly two vari‐
ables or expressions.
call stack
Also called the execution stack, this is the
list of function calls that have been made
but not completed. At a certain point of
execution, such as a crash, the call stack
makes up the traceback.
class
A class defines a collection of functions
and data. Classes also define constructors,
which describe how to create the objects
that those functions and data are associ‐
ated with.
command-line interface
The command-line interface, or CLI, pro‐
vides access to the shell. Commands can
be entered into the command-line prompt
to navigate the filesystem, run programs,
and manipulate files.
compilation
A step in building software when a com‐
piler converts source code into binary
machine code.
493
www.it-ebooks.info
compiled language
compiled language
A programming language in which source
code is first converted to binary machine
code and then executed. This two-step
process requires that all statements that
are to be executed be known before the
compilation step.
concatenate
A fancy word meaning “to append.”
configuration
A step in building software or executing
an analysis pipeline when platformdependent variables are detected and used
to customize a makefile or installation
script.
container
A data structure that holds other vari‐
ables. For example, a tuple, list, or dictio‐
nary.
DRY
The “don’t repeat yourself ” principle
states that any piece of code should be
defined once and only once and have a
single meaning. This promotes code
reuse.
dynamic language
A programming language where variable
types are not declared before they are
used.
exception
In software, an exception is a way of alert‐
ing the user to runtime errors in code
behavior. An exception can be thrown
from some place in the code and, option‐
ally, caught elsewhere in the code. If it is
thrown but not caught before reaching
global scope, it halts code execution and
prints a (hopefully) informative error
message.
continuous integration
A software development strategy in which
new code is tested and built regularly.
Usually, new code must pass all tests in
order to be accepted and cross-platform
installation issues are checked once or
more per day.
executable
A runnable program.
CPU-bound
Describes an operation that is limited by
the speed of the processor.
lobal scope
The namespace of the current module.
csh
The C SHell—an early shell program,
based on sh, for interacting with an oper‐
ating system through a terminal emulator.
current working directory
When you’re using the terminal, the cur‐
rent working directory is the one that you
are in. That is, it is the location from
which all commands are executed. It is
output as the result of the pwd command.
docstring
A specific syntax for documentation in
Python. It is the first unassigned string lit‐
eral in a function body and is usually
enclosed by three double quotes.
494
|
general-purpose language
A programming language that is meant to
cover a broad range of domains and com‐
putational problems.
glue language
A programming language that can easily
interact with multiple other programming
languages and their libraries.
grep
A command-line program for regular
expression–based pattern matching.
high-level language
A programming language that provides
useful, common abstractions over lowerlevel languages. Such languages are typi‐
cally more concise and easier to use than
their counterparts. Examples include
Python, R, and MATLAB.
Glossary
www.it-ebooks.info
regression test
inheritance
When one class is a subclass of another,
the subclass is said to be inheriting from
the superclass. Indeed, data and behavio‐
ral attributes of the parent (super)class are
passed down to the child (sub)class.
Linux
installation
The step in building software where exe‐
cutables, libraries, and include files associ‐
ated with a program are put in an
accessible place in the filesystem.
local scope
The namespace of the current function.
integration test
A type of test that exercises more than a
few units (functions) in a code base. Inte‐
gration tests often check that for simple
inputs, the code arrives at a final expected
answer as an ultimate output.
interpreted language
A programming language in which state‐
ments in the language are executed at run‐
time by sequentially being fed into a
special interpreter loop. There is no need
for a separate compile step in order to
execute code because of the interpreter.
High-level languages are often inter‐
preted, though not always!
issue trackers
Also known as issue ticketing systems or
bug trackers, these systems help to
streamline a project. Most often, they pro‐
vide an interface for organizing the pro‐
cess
of
identifying,
describing,
collaborating on, and solving software
bugs and new features.
ksh
The Korn SHell is an early shell program
that is backward-compatible with sh, but
extends its ability to interact with an oper‐
ating system through a terminal emulator.
linking
The step in building software in which an
executable is attached to an external
library on which it depends.
An open source operating system kernel
first developed by Linus Torvalds. There
are many flavors of Linux operating sys‐
tems available. The most popular of these
is Ubuntu.
memory-bound
Describes an operation whose speed is
limited by how much memory (typically
RAM) there is available.
metacharacter
A character that has a special meaning
aside from its literal meaning.
metadata
Data about data.
no-op
A no-operation statement, function, class,
or other language construct that exists as a
placeholder for when a value is required
but nothing should be executed.
object
In object-oriented programming, objects
are instances of a class. They are entities
that possess data and methods specific to
their class.
object orientation
A computational paradigm that associates
data and method attributes to classes of
objects.
redirection
When the output of one program is diver‐
ted into the input of another program or
file. Typically, this relies on the > syntax.
regression test
A test that serves to guarantee the preser‐
vation of expected behavior through
changes in the code base. A suite of unit
tests can serve as regression tests when
they are run after changes are made.
Glossary
www.it-ebooks.info
|
495
regular expressions
regular expressions
A language of metacharacters for pattern
matching.
relative path
This is a string describing the path from
the current directory to the file or direc‐
tory being indicated.
REPL
root
scope
sed
A read-eval-print loop is a standard
mechanism for text-based interactive
computing. Users type in commands that
are read in by the interpreter and evalu‐
ated while the user waits, and the results
are printed to the screen. This process
repeats until the user closes the interpreter
session.
This word has two meanings in Unix par‐
lance. In the context of filesystems, the
“root” directory is the one at the top of the
directory tree, indicated by /. In the con‐
text of user permissions, “root” is the
administrator, the top-level user of the
machine.
Scope defines what variables are available
inside of different constructs in a pro‐
gramming language. Scoping rules deter‐
mine how variables are looked up and
vary from language to language.
A command-line program for regular
expression–based pattern matching and
substitution.
sequence
A term used to describe any data structure
that imposes an integer ordering on its
values. This is roughly equivalent to the
mathematical construct by the same
name.
sh
496
The Bourne SHell, a venerable and popu‐
lar Unix shell that first appeared in Ver‐
sion 7 Unix.
|
singleton
A singleton is a class that only has one
instance in memory throughout the life‐
time of a program. The term “singleton”
may also apply to the variable itself. Sin‐
gletons are not necessarily constants.
string
A string is a character or list of characters.
It is a data type appropriate for data such
as names and paths.
symbolic link
A filesystem object that points to another
filesystem object (the “target”).
tcsh
The TENEX C SHell. Like most shells, this
shell seeks to provide a scripting language
for interacting with an operating system
through a terminal emulator.
terminal emulator
A graphical user interface that allows the
user access to a text prompt and applica‐
tions such as the command-line interface.
ternary
An operator that takes exactly three vari‐
ables or expressions.
test framework
A system for collecting and running unit
tests. Examples include nose, jUnit, xUnit,
and GTest.
traceback
Also called the stack trace, backtrace, or
stack backtrace. The traceback is a report
of the active function calls at a certain
point during execution.
unary
An operator that takes only one variable
or expression.
unit test
A test that operates on a single unit of
code. Typically, functions and methods
are the atomic units of code that are tested
in this case.
Glossary
www.it-ebooks.info
version control
Unix
Any computer operating system platform
that derives from the original AT&T Unix,
developed at the Bell Labs research center.
variable type
A variable’s type defines the internal prop‐
erties of the value, how it is stored, and
how other parts of a language may use the
variable. Variable types may be primitive
and built into the language or defined by
users and developers. A variable may be
checked to verify whether it “is a” certain
type. Certain types may be converted to
other types.
version control
Version control is a method by which a
repository is created for holding versions
of a set of files. Version control systems
include Git, Mercurial, SVN, CVS, and
more. These systems allow storage, recall,
and distribution of sets of files under
development. Plain-text files, such as code
and markup, are well suited to version
control.
Glossary
www.it-ebooks.info
|
497
www.it-ebooks.info
Bibliography
• Albrecht, Michael, Patrick Donnelly, Peter Bui, and Douglas Thain. “Makeflow:
A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and
Grids.” Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Exe‐
cution Engines and Technologies (2012): 1–13. ACM Press. doi:
10.1145/2443416.2443417.
• Alted, F. “Why Modern CPUs Are Starving and What Can Be Done about It.”
Computing in Science & Engineering 12 (2010): 68–71. doi:10.1109/MCSE.
2010.51.
• Barendregt, Henk P., and Erik Barendsen. “Introduction to Lambda Calculus.”
Nieuw archief voor wisenkunde 4 (1984): 337–372.
• Beck, Kent. Test-Driven Development: By Examples. Boston, MA: AddisonWesley, 2002.
• Chacon, Scott, and Ben Straub. Pro Git. 2nd ed. New York: Apress, 2014. http://
git-scm.com/book/en/v2.
• Collette, Andrew. Python and HDF5: Unlocking Scientific Data. Sebastopol, CA:
O’Reilly Media, 2013.
• Donoho, David L., Arian Maleki, Inam Ur Rahman, Morteza Shahram, and Vic‐
toria Stodden. “Reproducible Research in Computational Harmonic Analysis.”
Computing in Science & Engineering (11): 8–18. doi:10.1109/MCSE.2009.15.
• Drepper, Ulrich. “What Every Programmer Should Know About Memory.” Sep‐
tember 21, 2007. http://lwn.net/Articles/250967/.
499
www.it-ebooks.info
• Duarte, Gustavo. “What Your Computer Does While You Wait.” November 30,
2008. http://bit.ly/comp-wait.
• Feathers, Michael. Working Effectively with Legacy Code. Upper Saddle River, NJ:
Prentice Hall, 2004.
• Fenner, Martin. “One-Click Science Marketing.” Nature Materials 11 (2012): 261–
63. doi:10.1038/nmat3283.
• Friedl, Jeffrey E.F. Mastering Regular Expressions. Sebastopol, CA: O’Reilly Media,
2002.
• Gamma, Eric, John Vlissides, Ralph Johnson, and Richard Helm. Design Patterns:
Elements of Reusable Object-Oriented Software. Reading, MA: Addison-Wesley,
1995.
• Goble, Carole. “Better Software, Better Research.” IEEE Internet Computing 18
(2014): 4–8.
• Goldberg, David. “What Every Computer Scientist Should Know About FloatingPoint Arithmetic.” ACM Computing Surveys 23 (1991): 5–48.
• Inman, Matthew. “Why the Mantis Shrimp is my new favorite animal.” http://
theoatmeal.com/comics/mantis_shrimp.
• Irving, Damien. “Authorea: The Future of Scientific Writing?” April 20, 2014. Dr
Climate. Accessed September 29. http://bit.ly/authorea.
• Knight, Steven. “Building Software with SCons.” Computing in Science & Engi‐
neering 7 (2005): 79–88.
• Knuth, Donald Ervin. Computers & Typesetting, Volume A: The TeXbook. Read‐
ing, MA: Addison-Wesley, 1986. http://bit.ly/texbook.
• Lamport, Leslie. LaTeX: A Document Preparation System. 2nd ed. Reading, MA:
Addison-Wesley, 1994.
• Martin, Ken, and Bill Hoffman. Mastering CMake. Clifton Park, NY: Kitware,
2010.
500
|
Bibliography
www.it-ebooks.info
• Martin, Robert C. Clean Code: A Handbook of Agile Software Craftsmanship.
Upper Saddle River, NJ: Prentice Hall, 2008.
• McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy,
and IPython. Sebastopol, CA: O’Reilly Media, 2012.
• Merali, Zeeya. “Computational Science: …Error.” Nature 467 (2010): 775–77. doi:
10.1038/467775a.
• MissMJ. “Standard Model of Elementary Particles” June 27, 2006. http://bit.ly/
particle-model.
• Mozilla Developer Network. “Regular Expressions.” http://bit.ly/mdn-regex.
• Newham, Cameron. Learning the bash Shell. 3rd ed. Sebastopol, CA: O’Reilly
Media, 2005.
• NumPy developers. “NumPy.” http://www.numpy.org/index.html.
• Oetiker, Tobias, Hubert Partl, Irene Hyna, and Elisabeth Schlegl. “The Not So
Short Introduction to LaTeX2e.” Version 5.04, October 29, 2014. http://bit.ly/texshort.
• Pérez, Fernando, and Brian E. Granger. “IPython: A System for Interactive Scien‐
tific Computing.” Computing in Science and Engineering 9 (2007): 21–29. doi:
10.1109/MCSE.2007.53.
• Perl 5 Porters. “Perl 5 version 20.1 documentation - perlre.” http://perl‐
doc.perl.org/perlre.html.
• Prechelt, Lutz. “An Empirical Comparison of Seven Programming Languages.”
Computer 33 (2000): 23–29.
• Preshing, Jeff. “Hash Collision Probabilities.” May 4, 2011. http://bit.ly/hash-prob.
• Press, William H., Saul A. Teukolsky, William T. Vetterling, and Brian P. Flan‐
nery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge:
Cambridge University Press, 2007.
Bibliography
www.it-ebooks.info
|
501
• The SciPy community. “NumPy Reference Manual- Universal functions
(ufuncs).” http://bit.ly/ufunc.
• Stodden, Victoria. “The Scientific Method in Practice: Reproducibility in the
Computational Sciences.” MIT Sloan School Working Paper 4773-10, February
2010. http://bit.ly/repro-sci.
• Terry, Matthew. SciPy 2013 Yoink Lightning Talk. http://bit.ly/scipy-talk.
• Tufte, Edward R. The Visual Display of Quantitative Information. 2nd ed. Chesh‐
ire, CT: Graphics Press, 2001.
• Van Rossum, Guido, Barry Warsaw, and Nick Coghlan. “PEP8 — Style Guide for
Python Code.” Python Software Foundation, 2001. http://bit.ly/pep-8.
• Verborgh, Ruben. Using OpenRefine. Birmingham: Packt Publishing, 2013.
• Wikipedia. “Comparison of revision control software.” http://bit.ly/revcontrol.
• Wilson, Greg, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, et al. “Best
Practices for Scientific Computing.” PLoS Biol 12 (2014): e1001745. doi:10.1371/
journal.pbio.1001745.
• Zach, Richard. “LATEX for Philosophers.” November 7, 2013. http://bit.ly/texphil.
502
| Bibliography
www.it-ebooks.info
Index
Symbols
" (double quote), 49, 54
""" (triple double quotes), 55
# (comment character), 41, 432
$ (dollar sign), 1
% (modulo operator), 53
' (single quote), 49, 54
''' (triple single quotes), 55
( ) (parentheses operator), 55
* (wildcard character), 20, 180
+ (addition operator), 53
, (comma), 70
-- (double minus sign), 23
-r (recursive flag), 19
-rf (recursive force flag), 19
. (dot), 4, 55, 62, 183
.. (double dot), 62
/ (forward slash), 3
<> (angle brackets), 8
= (assignment operator), 48
= (equals sign), 23, 42
== (equality operator), 79
? (question mark), 183
@ sign, 112, 139
[ ] (square brackets), 8, 66, 186
[[]] (double square brackets), 8
\ (backslash character), 13, 54, 185
\ (escape character), 13, 54, 184
\n (newline character), 54
\r (carriage return character), 54
\t (tab character), 54
_ (underscore character), 13, 42
__ (double underscore character), 120
__init__( ) method, 127, 141
__new__( ) method, 141
{} (curly braces), 56, 71, 186
~ (tilde character), 5
ǀ (pipe command), 25, 186
˃ (arrow character), 25
˃˃ (double arrow), 25
˃˃˃ (Python prompt), 40
A
abs( ) function, 121
absolute imports, 61
absolute path, 3, 34, 493
accumulate( ) method, 223
add( ) method, 73
addition operator (+), 53
algorithms
data size and, 259
implementing, 161
in- and out-of-core, 249
non-parallel, 285-290
parallel, 282
zipping, 252
alias command, 36
alternation, 186
Amdahl's law, 281
American Mathematical Society, 452
Anaconda, installing, xxiii
analysis and visualization
analysis
data-driven, 162
model-driven, 160
tools for, 159
cleaning and munging data, 155-159
data preparation
503
www.it-ebooks.info
automated approach to, 148
experimental data, 149
metadata, 151
simulation data, 150
steps of, 145
importance of, 145
loading data
Blaze, 155
NumPy, 152
Pandas, 153
PyTables, 153
tools available, 151
visualization
best practices, 162, 176
Bokeh, 172
Gnuplot, 164
Inkscape, 174
matplotlib, 167-172
tool selection, 175
tools for, 164
analysis pipeline
automating creation of, 333
building/installing software
automation tools, 343
compilation, 345
dependency configuration, 345
makefile configuration, 343
overview of, 341
platform configuration, 344
system and user configuration, 344
compilation, 334
configuration, 333
execution, 334
installation, 334, 346
linking, 334
make utility
automatic updating by, 337
benefits of, 334
enabling, 337
makefiles, 337
running, 337
special targets, 340
target definition, 338
vs. bash scripts, 336
overview of tasks, 335
strategies for, 333
angle brackets (<>), 8
anonymous functions, 108
antipatterns, 142
504
|
APIs (application programming interfaces), 66,
472, 493
append( ) method, 67
approximations, 42
apropos command, 24
arange( ) function, 202, 221
arguments
adding, 22
keyword, 99
naming, 128
optional, 8
positional, 98
variable numbers of, 101
arithmetic operations, 211, 226, 280
array data languages, 201
array( ) function, 202
arrays
adding dimensions to, 214
Boolean arrays, 217
broadcasting, 212
comparison operators, 218
copying a slice, 210
creating, 202, 207
data types (dtypes), 204
fancy indexing, 215
fixed size of, 204
forcing data types in, 207
HDF5 data format, 240
immutability of dtypes, 221
manipulating attributes, 204
organizing with B-trees, 270
purpose of, 201
record arrays, 220
slicing and views, 208
structured arrays, 220
support for in programming languages, 201
temporary, 211
transforming, 223
arrow character (˃), 25
as keyword, 59
ASCII characters, 49
assertions, 405, 493
assignment, 42, 493
assignment operator (=), 48
astronomy and astrophysics projects, 487
at( ) method, 223
atof( ) function, 238
atoi( ) function, 238
atomic types, 238
Index
www.it-ebooks.info
Boolean values, 45
boolean variable type, 45
Bourne Shell (sh), 496
branches, 365
break statement, 87
broadcasting, 212
buckets, 262
build system automation tools, 343
attributes, 493
listing, 119
manipulating in arrays, 204
ndarray attributes, 203
string, 55
Automake/Autoconf, 343
automatic resizing, 261
awk, 493
adding action with, 196
benefits of, 195
example of, 195
overview of, 188
vs. sed and grep, 195
B
C
B-trees
array organization with, 270
best application of, 257, 269
libraries supporting, 271
rotation of, 270
rules for, 270
structure of, 269
vs. binary search trees, 270
backslash character (\), 13, 54, 185
backtrace (see traceback report)
Bash (Bourne Again SHell), 493
Bash scripts, 36, 335
Basic regular expressions (BRE), 182
.bashrc file, 33
Berkeley Software Distribution (BSD), 475
bibliographies, 456
BibTex/BibLaTex, 456
bidirectional/bijective hash maps, 263
Big Data, 229
big O notation, 259
bin directory, 3
binary operators, 46, 493
binary package managers, 311, 316
binary search trees, 270
(see also B-trees)
Binstar, 318
birthday paradox, 261
Bitbucket, 371
bitwise operators, 219
blank lines, deleting, 193
Blaze, 155
blist package, 271
Bokeh, 172
Boolean arrays, 217
call stacks, 395, 493
capturing parentheses, 192
carriage return character (\r), 54
cat command, 13
CC0 (Creative Commons Zero) license, 480
cd (change directory) command, 7
central processing unit (CPU), 235, 244
change mode (chmod) command, 29
change ownership (chown) command, 28
changes, tracking in publication process, 443
char (character) type, 49
character sets, 186
child threads, 291
chmod (change mode) command, 29
chown (change ownership) command, 28
chunking, 245
citations, 456
classes
attribute types associated with, 124
class files, 446
class inheritance, 133, 495
class keyword, 123
class variables, 124
constructors, 127
decorators, 139
defining, 123
definition of term, 493
duck typing, 133
function of, 117
instance variables, 126
main ideas of, 118
metaclasses, 140
methods, 129
polymorphism, 135-138
purpose of, 123
static methods, 132
CLI (command-line interface), 493
additional resources, 38
basics of, 1
Index
www.it-ebooks.info
|
505
computing environment, 31-36
getting help, 21-26
HDF5 utility commands, 254
managing files and directories, 11-21
metacharacters on, 179-187
navigating the shell, 1-11
permissions and sharing, 26-31
scripting with bash, 36
SSL (secure socket layer) protocol, 30
climate projects, 488
close( ) method, 234
cloud computing, 310, 325
CMake, 343
code
backing up online, 371
citing, 458
code reuse, 95
code reviews, 468
creating documentation for, 427-440
deploying, 309-328
direct collaboration, 461-470
legacy code, 427
legal issues surrounding, 471-484
scalability of, 281
scaling up, 281
self-documenting, 434
style guides for, 434
text editors for, 443
writing clean, 414, 434
code reuse, 104
collaboration
overview of, 461
pull requests, 468
ticketing systems
assigning issues, 466
benefits of, 462
closing issues, 468
discussing issues, 467
issue creation, 464
workflow, 462
collisions, 261
columns
manipulating data in, 195
working with in data frames, 268
comma (,), 70
comment character (#), 41, 432
commits, making, 374, 458
communicator objects, 301
community and pedagogy projects, 490
506
|
COMM_WORLD communicator, 302
comparison operators, 78, 218
compatibility, 478
compilation, 493
compiled languages, 39, 316, 334, 341, 494
complex patterns, 192
composable operators, 48
compound dtypes, 220
comprehensions, 90
compression, 252
computational physics
astronomy and astrophysics projects, 487
basic project steps, xix-xxi
benefits of regular expressions for, 178
geophysics, geography, and climate projects,
488
mathematics, 489
nuclear engineering projects, 488
pedagogy and community projects, 490
physics projects, 488
Python open source community, 491
Python operators for, 46
relationship of physics and computation,
xvii
scientific Python projects, 489
computational scale, 280
computer architecture, 235
computers, original use of term, 177
computing environment
configuring/customizing, 33
investigating with echo program, 31
nicknaming commands, 36
running programs, 34
saving variables, 33
computing systems
distributed computing, 283
high-performance, 283
high-throughput, 283, 327
simulated, 319
concatenation, 13, 53, 494
Conda package manager, xxiii, 316
conditionals
if-elif-else statements, 81
if-else statements, 80, 82
positive vs. negative, 81
syntax of, 77
configuration, 494
constants, scientific, 131
constructors, 127
Index
www.it-ebooks.info
containers, 494
deployment with, 321-325
lists, 66
purpose of, 65
sets, 71
tuples, 70
types of, 65
content management systems (see ticketing sys‐
tems)
contents, listing, 6
context management, 234
contiguous datasets, 245
continuous integration, 494
copyleft licenses, 478
copyright (see legal issues)
corner cases, 410
countdown( ) generator, 110
cp (copy) command, 17
cProfile, 396
CPU-bound, 249, 291, 494
CPython, 291
create_array( ) method, 240
create_carray( ) method, 247
create_group( ) method, 240
create_table( ) method, 240
Creative Commons (CC), 471, 478
cross references, creating, 29, 455
csh (C SHell), 494
CSV (comma separated values), 152
Ctrl-c command, 14
Ctrl-d command, 14
curly braces ({}), 56, 71, 186
current working directory, 4, 494
D
daemon threads, 291
dash dash (--), 23
data
chunking, 245
citing, 458
concealing with masks, 218
converting between formats, 155
CSV format, 152
experimental data, 149
HDF5 format, 153
improving access to, 244, 252, 259
manipulating columns of, 195
missing, 158
mutable vs. immutable types, 65
processing raw, 177
saving/loading in Python, 231
simulation data, 150
size limitations, 259
structuring, 155
time series data, 149
various formats for, 153
wrangling, 155
data analysis and visualization
analysis, 159-162
cleaning and munging, 155-159
importance of, 145
loading data, 151-155
preparing data, 145-151
visualization, 162-175
data frames
benefits of, 263, 269
best application of, 257, 266
creating/working with, 267
handling of missing data, 265
makeup of, 263
series, 264
structure of, 266
vs. other structures, 266
vs. tables, 263
data structures
arrays, 201-227
B-trees, 269
data frames, 263-269
hash tables, 258-263
k-d trees, 272-277
overview of, 257
data-driven analysis, 162
datasets
B-trees and, 271
chunking, 245
compressing, 252
contiguous, 245
HDF5 format, 240
inspecting via command line, 254
inspecting via graphical interfaces, 255
debugging
encountering bugs, 386
functions and, 95
importance of, 386
interactive debugging, 389
linting, 401
overview of, 385
pdb interactive debugger
Index
www.it-ebooks.info
|
507
continuing the execution, 394
features of, 390
getting help, 392
importing, 390
querying variables, 393
running functions and methods, 394
setting breakpoints, 395
setting the state, 393
setting the trace, 391
stepping forward, 392
print statements, 387
profiling, 396-401
reporting bugs with issue trackers, 463
decision making (see flow control and logic)
decorators
class, 139
function, 112-116
def keyword, 96, 123
default constructors, 127
degree of parallelism, 280
del (deletion) operator, 48
deployment
best practices for, 313
challenges of, 309
cloud-based, 325
develop-then-use cycle, 309
documentation, 427-440
goals for, 309
overview of, 329
packaging
binary package managers, 311, 316
Conda package manager, 316
containers, 321-325
cross-platform package managers, 316
distributable files, 311
package managers, 311
pip packaging tool, 312-316
source-based distributions, 311
virtual machines, 319
virtualizations, 311
supercomputer-based, 327
design patterns, 142
develop-then-use cycle, 309
developer guides, 431
dictionaries, 73, 89, 261
digital object identifier (DOI), 458
dimensions, adding to arrays, 214
dir( ) function, 119
directories
508
|
bin, 3
changing, 7
current working, 4, 494
deleting, 18
flags and wildcards, 20
home, 5
lib, 3
listing files and subdirectories, 6
making, 18
manipulating, 11-21
printing working, 4
root, 3, 496
searching multiple, 181
trees, 3
discard( ) method, 73
distributed computing, 283
distutils module, 312
Docker, 321-325
docstrings, 98, 122, 435, 494
document processing
common programs, 441, 442
separating content from formatting, 442
tracking changes, 443
WYSIWYG systems, 442
documentation
automated creation of, 436
avoiding over-documentation, 433
benefits of, 429
choices for, 429
comments, 432
docstrings, 435
importance of, 427
naming, 434
readme files, 431
self-documenting code, 434
theory manuals, 430
user/developer guides, 431
value of, 428
dollar sign ($), 1
dot (.), 4, 55, 62, 183
double arrow (˃˃), 25
double dot (..), 62
double minus sign (--), 23
double quote ("), 49, 54
double square brackets ([[]]), 8
double underscore character (__), 120
double-space formatting, 193
Doxygen, 436
DRY (don't repeat yourself) principle, 95, 494
Index
www.it-ebooks.info
dtypes (data types), 204, 220
duck typing, 65, 133
dunder (see double underscore)
dynamic languages, 39, 44, 494
E
echo program, 31
edge cases, 409
Electronic Frontier Foundation (EFF), 471
element-wise operations, 211
emacs text editor, 15
embarrassingly parallel problems, 282
empty files, 12
empty( ) function, 202, 221
encapsulation, 118, 123
encodings, 50
End-User License Agreement (EULA), 474
env command, 33
environment variables, 32
equality operator (==), 79
equals sign (=), 23, 42
error messages, 43, 84
escape character (\), 13, 54, 184
Evaluated Nuclear Data File (ENDF), 150
exception handling, 82
exceptions, 84, 405, 494
executables, 1, 494
execution pathways, 77
execution stacks (see call stacks)
exit( ) function, 40
experimental data, 149
explicit indexing, 51
explicit relative imports, 62
explicit typing, 134
export command, 32
export control, 484
expressions, 48
extend( ) method, 67
Extended regular expressions (ERE), 182
extension modules, 57
F
"fail early and often" credo, 44
False variable, 45
fancy indexing, 215
figures, publishing in LaTeX, 454
file handle object, 231
file space, 3
filenames, 13
files
accessing remote, 30
appending contents of, 25
benefits of regular expressions for, 177
class files, 446
closing, 232
copying and renaming, 17
creating
choices for, 11
GUIs for, 12
text editors, 13
touch command, 12
creating links in, 455
creating links to, 29
cross-referencing, 29
deleting, 18
finding and replacing patterns in, 190-195
finding filenames with patterns, 182-187
finding patterns in, 188
flags and wildcards, 20
formats for, 230
granting access to, 26
handling in HDF5 format, 239-242
handling in Python, 230-235
hidden, 34
inspecting head and tail, 10
license files, 471
listing, 6
listing with simple patterns, 180
manipulating, 11-21
modes for, 233, 239
operating on multiple, 179-187
overwriting contents of, 25
reading, 233
readme files, 431
redirecting, 15, 25
reducing size of, 252
safety code, 234
setting ownership of, 28
setting permissions on, 29
sharing, 26, 41
style files, 446
user-specific configuration in, 34
version control of
citations, 458
local, 349-369
remote, 371-383
viewing permissions on, 26
writing to, 233
Index
www.it-ebooks.info
|
509
filters, 92, 253
find and replace function, 190
find command, 182-187
fixed points, 107
flags
contextual meanings for, 23
using wildcards with, 20
FLASH license, 481
Flexible Image Transport System (FITS), 150
float( ) function, 238
floating-point arithmetic, 42, 161
FLOPS (floating-point operations per second),
280
flow control and logic
conditionals
if-elif-else statements, 81
if-else statements, 80, 82
syntax of, 77
exceptions, 82
importance of, 77
loops
comprehensions, 90
for loops, 88
formats for, 85
while loops, 86
types of, 77
for loops, 88
forks, 297
format( ) method, 56
forward slash (/), 3
FOSS/FLOSS licenses, 474
fragments command, 35
Free Software Foundation (FSF), 471
from keyword, 62
from-import statement, 58
frozenset data type, 73
full stop (.) (see dot)
functional programming, 109
functions
anonymous, 108
basis of, 95
best practices for, 434
constructors, 127
decorators, 112-116
defining, 96
dunder vs. built-in, 121
generators, 109
in NumPy, 226
keyword arguments, 99
510
|
lambdas, 108
modifying, 112
multiple return values, 103
object orientation and, 118
purpose of, 95
recursion, 107
scope, 104
string, 55
universal functions (ufuncs), 223
variable number of arguments, 101
vs. methods, 129
working with, 95
G
general-purpose languages, 39, 201, 494
generators, 109
geometry problems, 277
geophysics and geography projects, 488
Git
additional resources on, 384
checking repo status (git status), 357
common commands, 353
configuring, 354
conflict resolution, 369
creating repositories (git init), 355
discarding revisions (git revert), 364
getting help, 352
installing, 352
merging branches (git merge), 367
saving a snapshot (git commit), 358
staging files (git add), 357
switching branches (git checkout), 366
unstaging (reverting) files (git reset), 363
version control tasks, 355
viewing file differences (git diff), 362
viewing repo history (git log), 361
working with branches (git branch), 365
Git Bash, 1
GitHub
account creation, 372
adding new remotes, 376
checking repo status (git status), 382
conflict resolution, 381
customizable tags in, 465
declaring a remote (git remote), 373
downloading repositories (git clone), 375
features of, 372
fetching and merging (git pull), 380
fetching remote contents (git fetch), 379
Index
www.it-ebooks.info
forking a repository, 377
merging remote contents (git merge), 380
milestones in, 466
repository creation, 373
repository hosting, 371
sending commits (git push), 374
ticketing system in, 462-468
global interpreter lock (GIL), 291
global scope, 104, 494
global substitution, 191
globals( ) function, 252
glue languages, 39, 494
GNU General Public License (GPL), 477
GNU Portable Threads, 307
Gnuplot, 164
Google Code, 371
Google search, 188
GPUs (graphics cards), 237, 280
gradient( ) function, 212
graphicx, 452
grep, 494
finding patterns in files with, 188
history of, 178
overview of, 187
search options, 189
vs. sed, 195
Grid Engine, 328
group access, 26
H
h5dump command-line tool, 254
h5ls command-line tool, 254
h5py, 230
halting problem, 87
hard disk drives (HDDs), 235
hard links, 29
hash tables, 73
benefits of, 259, 263
best application of, 257, 277
bidirectional/bijective hash maps, 263
collisions, 261
example, 258, 260
inner workings of, 258
popularity of, 258
Python hash( ) function, 73, 258
resizing, 259
hashability, 72
HDF5
benefits of, 229, 235, 237
chunking, 245
compression, 252
converting into NumPy arrays, 153
features of, 238
file manipulation in, 239-242
hierarchical structure of, 238, 242
in- and out-of-core operations, 249
inspecting files
via command line, 254
via graphical interface, 255
memory mapping, 242
querying, 252
tables, 240
utility tools, 254
hdfview graphical tool, 255
head command, 10, 25
help
apropos command, 24
combining simple utilities, 25
man (manual) program, 21
help( ) function, 40
heterogeneous data, 71
heterogeneous problems, 282
Hierarchical Data Format 5 (see HDF5)
high-latency tasks, 292
high-level languages, 39, 95, 312, 494
high-performance computing (HPC), 283
high-throughput computing (HTC), 283, 327
HIPAA (Health Insurance Portability and
Accountability Act), 484
HMTL plots, 172
home directory, 5
homogeneous data, 71
HyperText Markup Language (HTML), 444
hypervisors, 319
I
identity operator (is), 79
if statement, 78
if-elif-else statements, 81
if-else statements, 82
immutable data types, 65, 73
implicit indexing, 51
implicit relative importing, 61
import statement, 58
imprecise types, 42
in-core operations, 249
indentations, 79
indexing
Index
www.it-ebooks.info
|
511
duck typing and, 66
fancy indexing, 215
of data frames, 263
of hash tables, 262
of lists, 67
techniques for, 50
infinite loops, 86
Infrastructure-as-a-Service (IaaS), 325
inheritance, 495
class interface and, 133
definition of term, 118
graphic representation of, 138
in forks, 297
multiple inheritance, 138
polymorphism and, 135
vs. polymorphism, 138
initial state, 127
initialization, 127
Inkscape, 174
installation, 495
instance variables, 126, 131
instantiation, 133
int( ) function, 238
integers, 42
integration tests, 414, 495
integration, continuous, 494
interfaces, 66
interior tests, 409
internal references, 455
interpreted languages, 39, 291, 495
IPython, 40
IPython Notebook, 460
IPython Parallel, 307
is (identity operator), 79
isdigit( ) method, 56
isinstance( ) function, 134
issue trackers, 461, 495
items( ) method, 89
iteration, 86
J
JSON License, 480
K
k-d trees
best application of, 257, 272, 277
documentation on, 277
example, 273
KDTree class, 274
512
organization of, 273
vs. binary search trees, 272
k-dimensional trees (see k-d trees)
kernel, 319
kernprof, 400
key function, 109
key-value mapping, 258, 272
keys( ) method, 89
keyword arguments, 99
keywords, searching for, 24
ksh (Korn SHell), 495
KVM, 320
L
lambdas, 108
LaTeX
basics of, 335
benefits of, 442
building documents, 447
constituent parts of, 445
document class, 446
document structure, 449
environments, 446
extension packages, 452
internal references, 455
math formulae, 450
metadata, 447
preamble, 446
tables and figures, 454
legacy code, 427
legal issues
Berkeley Software Distribution (BSD), 475
changing licenses, 482
choosing a software license, 474
copyrights, 472
Creative Commons (CC), 478
export control, 484
GNU General Public License (GPL), 477
HIPAA compliance, 484
license files, 471
obtaining council, 471
other licenses, 480
patents, 483
public domain, 473
right of first publication, 473
trademarks, 483
less program, 23
lib directory, 3
licenses (see legal issues)
| Index
www.it-ebooks.info
line numbers, adding, 193
link (ln) command, 29
linking, 455, 495
linspace( ) function, 203
linting, 401
Linux, 495
Linux Containers (LXC), 321
list( ) conversion function, 68
lists, 66-70
literal characters, 179
literal types, 42, 49, 54
literals, escaping, 185
ln (link) command, 29
ln -s command, 29
loadtxt( ) function, 152
local scope, 104, 495
locals( ) function, 252
logic (see flow control and logic)
logical operators, 78
logspace( ) function, 203
loops
comprehensions, 90
customizing with generators, 109
for loops, 88
formats for, 85
infinite, 86
nonterminating, 86
while loops, 86
low-level languages, 66
lower( ) method, 56
ls (list) command, 6, 26, 181
M
magic objects, 120
make utility
automatic updating by, 337
benefits of, 334
enabling, 337
makefile configuration, 343
makefiles, 337
running, 337
special targets, 340
target definition, 338
vs. bash scripts, 336
man (manual) program, 21
markup languages
bibliographies, 456
choices of, 444
LaTeX
building documents, 447
constituent parts of, 445
document class, 446
document structure, 449
environments, 446
extension packages, 452
internal references, 455
math formulae, 450
metadata, 447
preamble, 446
tables and figures, 454
process of using, 444
reference managers, 458
masks, 217
math formulae, typesetting, 450
mathematics projects, 489
matplotlib, 167-172
max( ) function, 101
memory mapping, 242
memory-bound, 249, 495
metacharacters, 495
basic rules for, 179
escaping, 184
globally finding filenames with patterns,
182-187
listing files with simple patterns, 180
special vs. literal meaning of, 185
usefulness of, 179
vs. literal characters, 179
wildcard character, 180
metaclasses, 140
metadata, 495
importance of including, 151
in issue tracking, 465
in LaTeX, 447
reference managers and, 458
updating with touch command, 13
metaprogramming, 139
methods
instance variables and, 131
listing, 119
requirements for, 128
static methods, 132
string methods, 55
vs. functions, 129
Microsoft Word, 443
milestones, 466
Miniconda, installing, xxiii, 317
minus minus sign (--), 23
Index
www.it-ebooks.info
|
513
missing data
dealing with in Pandas, 158
handling of in data frames, 265
mkdir (make directory) command, 18
model-driven analysis, 160
module scope, 104
modules
aliasing imports, 59
aliasing variables on import, 59
basics of, 57
benefits of, 57
distutils module, 312
extension modules, 57
importing, 58
importing variables from, 58
in Python standard library, 62
math module, 225
multiprocessing, 297
packages, 60
pstats module, 396
regular expression module, 177, 197
scipy.constants module, 131
third-party, 63
threading module, 291
modulo operator (%), 53
more program, 23
MPI (Message-Passing Interface)
appropriate use of, 301
basics of, 301
benefits of, 301
role in supercomputing, 300
scalability of, 306
specifications for, 300
mpi4py package, 301
msysGit, xxiv
multidimensional slicing, 209
multiline strings, 55
multiple inheritance, 138
multiprocessing
appropriate use of, 300
benefits of, 296
implementation of, 297
scalability of, 299
vs. threading, 297
multitasking, 296
munging, 155
mutability, 65
mv (move) command, 17
514
|
N
N-body problem, 284, 292
N-dimensional array class, 202
NaN (Not a Number) values, 265
nano text editor, 15
natural naming, 240, 434
negative conditionals, 81
negative indices, 51
nesting, 87, 114
newaxis variable, 214
newline character (\n), 54
next function( ), 109
nil value, 45
no-op (no-operation) statements, 495
nodes, 283
non-parallel programs, 285-290
None variable, 45
nonterminating loops, 86
normalization, 91
nose testing framework, 404
notebook web-based browser, 41
NotImplemented, 45
np (see NumPy)
nuclear engineering projects, 488
null( ) function, 96
NULL/null value, 45
numexpr library, 251
NumFOCUS (NF), 471
NumPy, 152
arithmetic, 211
array class in, 202, 301
array creation, 202
attribute modification, 204
benefits of, 201
bitwise operators, 219
broadcasting, 212
documentation on, 203
dtypes (data types), 204
fancy indexing, 215
linspace( ) and logspace( ) functions, 203
masking, 217
miscellaneous functions, 226
ndarray attributes, 203
newaxis variable, 214
np.dot( ) function, 213
np.reshape( ) function, 204
slicing and views in, 208
structured arrays, 220, 267
universal functions, 223
Index
www.it-ebooks.info
where( ) function, 219
O
object orientation, 495
additional resources, 142
applications for, 117
basics of, 117
benefits of, 126
concept of, 118
design patterns, 142
features of, 118
main ideas of, 118
polymorphism in, 135
reductionism and, 118
objects, 119-123, 495
ones( ) function, 202, 221
online reference manuals, 21
open addressing, 262
open source community, 491
open( ) function, 231
open-source licenses, 474
OpenMP, 307
options, 22
"or" syntax, 186
OSS licenses, 474
out-of-core operations, 249
outer( ) method, 223
output, sending into files, 25
ownership (see legal issues)
P
packages
definition of, 57
installing, xxiv
managing, xxiii, 311
(see also deployment)
Python syntax for, 60
Pandas, 153, 263
parallelism
basics of, 279
benefits and drawbacks of, 279
challenges of, 283
cross-platform, 307
event-driven, 307
libraries for, 291, 297, 302
low-level, 307
maximum degree of, 280
measurement of, 280
multiprocessing, 296-306
N-body example problem, 284
problem classification, 282
scale and scalability, 280
threads, 290-296
vs. sequential programs, 285-290
vs. web-based parallelism, 307
parameterization, 97
parent classes, 137
parentheses operator ( ), 55
particle physics, 123
pass statement, 96
patches, 468
patents, 483
PATH environment variable, 35
path-generating shortcuts, 10
paths, 3, 34
pdb interactive debugger
continuing the execution, 394
features of, 390
getting help, 392
importing, 390
querying variables, 393
running functions and methods, 394
setting breakpoints, 395
setting the state, 393
setting the trace, 391
stepping forward, 392
pedagogy and community projects, 490
PEP8 Style Guide, 120, 434
Perl-compatible regular expressions (PCRE),
182
permissions and sharing
basics of, 26
connecting to other computers, 30
linking files/programs, 29
seeing permissions, 26
setting file ownership, 28
setting permissions, 29
permissive licenses, 477
physics
particle physics, 123
physics-based fields, xviii
projects in, 488
pickling, 301
pip packaging tool, 312-316
pipe command (ǀ), 25, 186
pipeline (see analysis pipeline)
plain text markup, 444
Platform-as-a-Service (PaaS), 325
Index
www.it-ebooks.info
|
515
plotting libraries
Gnuplot, 164
matplotlib, 167-172
point-to-point communication, 302
polymorphism
graphic representation of, 138
inheritance and, 135
multiple inheritance, 138
overview of, 119
subclasses, 136
superclasses, 137
vs. inheritance, 138
positional arguments, 98
positive conditionals, 81
preamble, 446
precise types, 42
precompiled languages, 341
print statements, 387
print working directory (pwd) command, 4
profiling, 396-401
programs
chaining together, 25
creating links to, 29
interrupting, 14
locating built-ins in bash shell, 24
never-terminating, 14
running, 34
writing in parallel, 290-306
writing in serial, 285-290
projects
astronomy and astrophysics, 487
basics steps of, xix-xxi
geophysics, geography, and climate, 488
legal issues surrounding, 471-484
mathematics, 489
nuclear engineering, 488
pedagogy and community, 490
physics, 488
Python open source community, 491
scientific Phyton, 489
prompt, 1, 40
proprietary licenses, 474
pstats module, 396
ptdump command-line tool, 254
public domain, 473
publication
document processing
common programs, 442
overview of, 441
516
|
separating content from formatting, 442
tracking changes, 443
WYSIWYG systems, 442
legal issues surrounding, 471-484
markup languages
bibliographies, 456
choices of, 444
LaTeX, 445-456
process of using, 444
reference managers, 458
right of first publication, 473
self-publishing, 473
text editors, 443
pull requests, 468
pwd (print working directory) command, 4
pyflakes linting tool, 401
PyPy project, 291
PyTables
compression routines in, 253
dataset classes in, 238
loading data with, 153
obtaining, 239
querying, 252
vs. h5py, 230
Python
B-tree support libraries, 271
benefits of, 39, 63
built-in data containers, 65-76
comment characters, 41
dictionary resizing, 261
drawbacks of, 39
duck typing and, 65
dunder vs. built-in functions, 121
exiting, 40
expressions, 48
file handling in, 230-235
getting help, 40
hash( ) function in, 73, 258
installing, xxiii
math module, 225
modules
aliasing imports, 59
aliasing variables on import, 59
basics of, 57
importing, 58
importing variables from, 58
in standard library, 62
packages, 60
mutability and, 65
Index
www.it-ebooks.info
numexpr library, 251
open source community, 491
operators in, 46
pandas package, 263
PEP8 Style Guide, 434
Python prompt, 40
reference counting in, 68
regular expression module in, 177, 197
running, 40
scientific Python, 489
special variables
Boolean values, 45
None, 45
NotImplemented, 45
standard library, 57, 62
statements, 49
strings
basics of, 49
concatenation, 53
indexing, 50
literals, 54
string methods, 55
threading in, 290
variables, 42
whitespace separation in, 79
Python 2, 50
Python 3.3, 50
Python Packaging Authority (PyPA), 312
Python Software Foundation License, 480
Q
querying
HDF5 data format, 252
k-d trees, 275
question mark (?), 183
R
raise keyword, 84
random-access memory (RAM), 235, 280
ranks, 301
read-only mode, 233
read/write permission, 26, 29
readme files, 431
record arrays, 220
recursion, 107
recursive flag (-r), 19
recursive force flag (-rf), 19
redirection, 13, 25, 495
reduce( ) method, 223
reduceat( ) method, 223
reductionism, 118
reference counting, 68
reference managers, 458
reference manuals, 21
references, internal, 455
regex (see regular expressions)
regression tests, 416, 495
regular expressions, 496
applications for, 177, 179
awk
adding action with, 196
benefits of, 195
example of, 195
overview of, 188
vs. sed and grep, 195
benefits of, 177
grep
finding patterns in files with, 188
overview of, 187
search options, 189
history of, 178
matching examples, 194
metacharacters
basic rules for, 179
escaping, 184
globally finding filenames with patterns,
182-187
listing files with simple patterns, 180
special vs. literal meaning of, 185
usefulness of, 179
vs. literal characters, 179
wildcard character, 180
sed
adding double-space formatting, 193
adding line numbers, 193
complex patterns, 192
deleting blank lines, 193
multiple replacement tasks, 191
overview of, 188
saving output to new files, 191
syntax for, 190
vs. grep, 190
text matching with, 178
varied implementations of, 182
relative paths, 4, 34, 496
REPLs (read-eval-print loops), 40, 496
repositories
checking status of, 357
Index
www.it-ebooks.info
|
517
creating, 355, 373
downloading, 375
forking, 377
hosting, 371
sending commits to, 374
reproducibility, 148, 309, 349, 472
resizing hash tables, 259
resources, sharing, 29, 327
return keyword, 97
right of first publication, 473
rm (remove) command, 18
root directory, 3, 496
root users, 496
RunSnakeRun, 397
S
scalability, 281
scale
definition of term, 280
measurement of, 280
strong scaling, 281
weak scaling, 281
scaling up, 281
schedulers, 328
scientific constants, 131
scientific plotting (see plotting libraries)
scientific Python, 489
scipy.constants module, 131
SCons, 343
scope, 104, 496
scp (secure copy) command, 31
scripts, creating, 36
search and replace function, 190
search function, 188
(see also grep)
sed, 496
adding double-space formatting, 193
adding line numbers, 193
complex patterns, 192
deleting blank lines, 193
multiple replacement tasks, 191
overview of, 188
saving output to new files, 191
syntax for, 190
vs. awk, 195
vs. grep, 190
self argument, 128
self-documenting code, 434
self-publishing, 473
518
|
separate chaining, 262
sequence, 50, 496
sequential programs, 285-290
series, 263
sets, 71
sh (Bourne SHell), 496
shared resources, 327
shell
basics of, 1
benefits of, 3
changing directories, 7
characteristic of, 2
escape characters in, 184
file inspection (head and tail), 10
home directory (~), 5
listing contents, 6
navigating, 1-11
paths and pwd, 3
types of, 2
shell variables, 32
simulation data, 150
single quote ('), 49, 54
singletons, 44, 496
Slashdot effect, 326
slices, 51, 208
SnakeViz, 399
software (see analysis pipeline; deployment;
programs)
Software Freedom Conservancy (SWC), 471
Software-as-a-Service (SaaS), 325
solid state drives (SSDs), 235
sort command, 23
sorted( ) function, 109
source code, 40
source command, 34
source-based package managers, 311
SourceForge, 371
spaces, vs. tabs, 80
spawning, 291, 297
special variables, 44
speedup (s), 281
Sphinx, 436
square brackets ([ ]), 8, 66, 186
SSH (Secure SHell), 30
ssh command, 31
stack backtrace (see traceback report)
stack trace (see traceback report)
"starving CPU" problem, 244
state, 65
Index
www.it-ebooks.info
statements, 49
static methods, 132
statistical calculations, 263
StopIteration error, 110
strings, 496
basics of, 49
concatenation, 53
converting other types to, 49
defining string literals, 49
escape characters for, 54
indexing, 50
multiline strings, 55
prefixes for, 55
purpose of, 42
string methods, 55
working with string literals, 54
strip( ) method, 56
strong scaling, 281
structured arrays, 220
style files, 446
style guides, 434
subclasses, 136
subdirectories
listing, 6
packages, 60
searching multiple, 181
subpackages, 60
substitution, 190, 191
sum( ) function, 226
superclasses, 137
supercomputers, 300, 310, 327
swapcase( ) method, 56
switches, 22
symbolic links, 29, 496
syntactic style, 434
T
tab character (\t), 54
tables
HDF5 data format, 240
in NumPy, 220
publishing in LaTeX, 454
vs. data frames, 263
tabs, vs. spaces, 80
tcsh (TENEX C SHell), 496
temporary arrays, 211
terminal emulators, 1, 496
ternary operators, 46, 496
test frameworks, 404, 496
test suites, 403
test-driven development (TDD), 419
testing
as core principal, 404
benefits of, 404
concept of, 403
corner cases, 410
edge cases, 409
for equivalence vs. equality, 407
importance of, 404
integration tests, 414
interior tests, 409
placement of, 405
regression tests, 416
running tests, 409
selecting targets for, 406
test coverage, 418
test fixtures, 412
test generators, 417
test matrix, 417
timing of, 405
unit tests, 412
text editors, 13, 15, 443
text matching, 178
(see also regular expressions)
theory manuals, 430
third-party modules, 63
threads
appropriate use of, 290, 292
benefits of, 291
child threads, 291
communication and spawning in, 291
daemon threads, 291
drawbacks of, 291
module for, 291
N-body problem solution using, 292
speed limitations on, 291
vs. multiprocessing, 297
ticketing systems
assigning issues, 466
benefits of, 462
closing issues, 468
discussing issues, 467
issue creation, 464
workflow, 462
tilde character (~), 5
time series data, 149
TORQUE, 328
touch command, 12
Index
www.it-ebooks.info
|
519
traceback report, 395, 496
tracking changes, 443
trademarks, 483
triple double quotes ("""), 55
triple single quotes ('''), 55
True variable, 45
tuples, 70
Twisted, 307
2-body problem, 284
type( ) function, 43
TypeError message, 45
U
unary operators, 46, 496
underscore character (_), 13, 42
Unicode, 49
unit tests, 412, 496
universal functions (ufuncs), 223
Unix, 497
update( ) method, 75
upper( ) method, 56
user access, 26
user guides, 431
UTF-8 encoding, 50
V
ValueError message, 43
values( ) method, 89
values, multiple return, 103
variables
assigning names, 42, 434
Boolean, 45
class-level variables, 124
environment, 32
in Python, 42
instance variables, 126, 131
specifying, 23
types of, 42, 497
variables, in Python, 58
vector graphics, 174
version control, 497
basics of, 349
benefits of, 349
example of, 350
importance to reproducibility, 349
520
local, 349-369
remote, 371-383
tool types, 351
tools for, 350
with Git, 352-369, 371
view( ) method, 210
views, 209
vim text editor, 15, 190
viral licenses, 478
virtual machines (VM), 319
VirtualBox, 320
virtualization, 310, 321
visualization (see analysis and visualization)
ViTables database viewer, 255
VMware, 320
W
weak scaling, 281
where( ) function, 219, 252
while loops, 86
whitespace separation, 79
whitespace syntax, 80
whitespace, removing, 56
wildcard character (*), 20, 180
write (w) mode, 233
WYSIWYG (What You See Is What You Get),
441
X
XenServer, 320
Y
yes program, 14
yield keyword, 109
Z
zero, 45
zero-indexed languages, 50
ZeroMQ, 307
zeros( ) function, 202, 221
zipping, 252
zlib library, 253
Zope Object Database (ZODB), 271
| Index
www.it-ebooks.info
About the Authors
Anthony Scopatz is a computational physicist and longtime Python developer.
Anthony holds a BS in Physics from UC Santa Barbara and a Ph.D. in Mechanical/
Nuclear Engineering from UT Austin. A former Enthought employee, he did his
postdoctoral studies at the Flash Center at the University of Chicago, in the Astrophy‐
sics Department. He is currently a staff scientist in the Department of Engineering
Physics at the University of Wisconsin–Madison. Anthony’s research interests revolve
around essential physics modeling of the nuclear fuel cycle, and information theory
and entropy. Anthony is proudly a fellow of the Python Software Foundation and has
published and spoken at numerous conferences on a variety of science and software
development topics.
Kathryn Huff is a fellow with the Berkeley Institute for Data Science and a postdoc‐
toral scholar with the Nuclear Science and Security Consortium at the University of
California Berkeley. In 2013, she received her Ph.D. in Nuclear Engineering from the
University of Wisconsin–Madison. She also holds a BS in Physics from the University
of Chicago. She has participated in varied research in areas including experimental
cosmological astrophysics, experimental non-equilibrium granular material phase
dynamics, computational nuclear fuel cycle analysis, and computational reactor acci‐
dent neutronics. At Wisconsin, she was a founder of scientific computing group The
Hacker Within, and she has been an instructor for Software Carpentry since 2011.
Among other professional services, she is currently a division officer in the American
Nuclear Society and has served two consecutive years as the Technical Program CoChair of the Scientific Computing with Python (SciPy) conference.
Colophon
The animal on the cover of Effective Computation in Physics is a bobtail squid (of the
order Sepiolida). Bobtail squids are part of a group of cephalopods that are closely
related to cuttlefish, but do not have a cuttlebone. They have eight arms and two ten‐
tacles and are generally quite small (usually between 1 and 8 centimeters).
Bobtail squid can be found in the shallow coastal waters of the Pacific Ocean as well
as in some areas of the Indian Ocean and off the Cape Peninsula of South Africa. In
some parts of the world, they are known as the “dumpling squid” or “stubby squid”
because of their rounded bodies. Like cuttlefish, they can swim either by using the
fins on the outside of their bodies or by using jet propulsion.
Because they live in shallow waters, the bobtail squid has developed a symbiotic rela‐
tionship with the bioluminescent bacteria Vibrio fischeri, which provide camouflage
in return for food and habitat. The bacteria live in a special organ inside the squid’s
mantle (body cavity) and are fed a sugar and amino acid mixture. The bacteria then
www.it-ebooks.info
emit enough light to match what hits the top of the squid’s body, hiding its silhouette
from predators swimming below. This symbiosis begins almost immediately after the
squid hatches and even induces the morphological changes that lead to maturity.
About 70 species of bobtail squid are known, but taxonomy within the Cephalopoda
class is controversial. The number could change in the future as more species and
evolutionary evidence are discovered.
Many of the animals on O’Reilly covers are endangered; all of them are important to
the world. To learn more about how you can help, go to animals.oreilly.com.
The cover image is from a loose plate of unknown origin. The cover fonts are URW
Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the heading font
is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.
www.it-ebooks.info
Source Exif Data:
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.6
Linearized : No
Author : Anthony Scopatz & Kathryn Huff
Create Date : 2015:06:09 16:04:21Z
Modify Date : 2015:08:10 10:14:56+03:00
Www It-ebooks Info : {6F114860-FA26-42C9-B26F-48EC50C4FBE9}
Keywords : www.it-ebooks.info
Language : EN
XMP Toolkit : Image::ExifTool 9.90
Creator : www.it-ebooks.info
Format : application/pdf
Subject : www.it-ebooks.info
Title : Effective Computation in Physics
Producer : www.it-ebooks.info
Trapped : False
Creator Tool : AH CSS Formatter V6.2 MR4 for Linux64 : 6.2.6.18551 (2014/09/24 15:00JST)
Metadata Date : 2015:06:29 12:33:04-04:00
Document ID : uuid:5e46320e-2d7f-dc4f-81b9-c901bdcc2256
Instance ID : uuid:86f18c08-7eba-114c-b9bc-3f6431a1f243
Page Layout : SinglePage
Page Mode : UseOutlines
Page Count : 552