Stata Guide 2012

stata-guide-2012

User Manual:

Open the PDF directly: View PDF .
Page Count: 154

```Statistics with Stata
Student guide
Version 0.9.8.4, by François Briatte

Contents
Introduction
1.!
2.!
3.!
4.!

Basics
Computers
Stata
Research

3!
8!
16!
21!

30!

Data
5.!
6.!
7.!
8.!

2!

Structure
Exploration
Datasets
Variables

31!
42!
49!
56!

Analysis

70!

9.!
10.!
11.!
12.!

71!
87!
108!
118!

Distributions
Association
Regression
Cheat sheet

Projects
13.!
14.!
15.!
16.!

Formatting
Assignment No. 1
Assignment No. 2
Final paper

124!
125!
133!
140!
146!

Acknowledgements go first and foremost to Ivaylo Petev, with whom I co-taught three of the
five courses for which this guide was written. I also benefitted from a lot of friendly advice from
Baptiste Coulmont, Emiliano Grossman, Sarah McLaughlin, Vincent Tiberj and Hyungsoo Woo.
All mistakes and omissions, as well as the views expressed, are mine and mine alone.

!

Introduction
This guide was written for a set of five quasi-identical postgraduate courses run at Sciences Po in Paris from Fall 2010 to Spring 2012. The full course material appears online
The course is organised around three learning objectives:
First, it introduces some essential aspects of statistics, ranging from describing variables to running a multiple regression. The course requires reading statistical theory applied to social surveys as a preliminary to all course sessions.
Second, it introduces how to operate those procedures with Stata, a statistical software
that we will practice during class. The course also requires that you practice using Stata
outside class in order to become sufficiently familiar with it.
Third, the course will lead you to develop a small research project, on which your
grade for the course will be based. The course therefore requires regular attendance
and homework, which will lead to writing up that research project.
This guide covers the following topics:
−

The course basics, Stata fundamentals and essential computer skills

−

Basic operations in data preparation and management (Part 1)

−

Introductory quantitative methods and statistical analysis (Part 2)

−

Instructions for the assignments and the final paper (Part 3)

2

!

1. Basics
Quantitative methods designate a specific branch of social science methodology, within
which statistical procedures are applied to quantitative data in order to produce interpretations of complex, recurrent phenomena.
Just as in other domains of scientific inquiry, the complexity and precision of statistical
procedures are necessary requirements to the study of some large-scale phenomena by
social scientists. Recent examples of such topics include the evaluation of a program
aimed at developing fertilizer use in Kenya (Duflo, Kremer and Robinson, NBER Working Paper, 2009), an explanation of attitudes towards highly skilled and low-skilled immigration in the United States (Hainmueller and Hiscox, American Political Science Review, 2010), and a retrospective electoral analysis of the vote that put Adolf Hitler into
power in interwar Germany (King et al., Journal of Economic History, 2008).
Quantitative methods courses come with a particular set of principles, which might be
arbitrarily summarized as such:
−

Researchers learn and share their knowledge of quantitative methods to the
largest possible audience, and to the best of their abilities.

−

Quantitative data are shared publicly, along with all necessary resources to replicate their analysis (such as do-files when using Stata).

On the learning side, some very simple principles apply:
−

Quantitative methods are accessible to everyone interested in learning how to
use them. Curiosity comes first.

−

There is no learning substitute to reading, practicing and looking for help, from
all kinds of sources. Reading comes first.

−

Making mistakes, correcting one’s own errors and hitting one’s own limits are
intrinsic to learning. Trial-and-error comes first.

Statistical reasoning and quantitative methods are intellectually challenging for teachers
and students alike, and a collective effort is required for the course to work out:
−

You will have to attend all course sessions: your instructors expect systematic
attendance; catch up with the sessions that you have missed.

−

When attending classes, attend classes: your instructors will feel completely
useless if you do anything else, like reading your email or browsing whatever.

−

Assignments will be graded in order to monitor your progress: no assignments,

−

Read all course material: read everything you are told to if you want to understand what you are learning in this course.
3

!
These, of course, are similar to the course requirements for your other classes.

1.1.

Homework

Apart from attending the weekly two-hour course sessions, you are required to:
−

Complete readings from the handbook and other material, as indicated in the
course syllabus. This will take you approximately one hour per week, perhaps
two if statistics are completely new to you.

−

Replicate course sessions outside class, using the do-files provided on the
course website. This will take you between half an hour and one hour per week,

−

Work on your research project and assignments, using the instructions provided during class and in the course documentation. Your project will require between one hour and a half to three hours of work per week, depending on your
learning curve and on the project itself.

In total, the time of study for this course amounts to two hours of class and between
three to six hours of homework spent on your research project. The time of study for
this course is variable, but a fair estimate is that you will spend between five and eight
hours per week studying for this course.
It is important to state right from the start that it is not possible to follow the course
irregularly, either by skipping weeks and trying to catch up later, and/or by allocating
long periods of last-minute work before deadlines. Experience shows that these strategies systematically lead to low achievement and grades.

1.2.

Assignments

The course was conceived with a hands-on focus. This means that you are not expected to take lecture notes and then revise for a final graded examination. Instead,
you will develop a research project throughout the course.
Your work will be corrected, commented and assessed twice during the course,
through assignments that will help monitor your progress. The final version of your project will make up for the largest part of your grade.
In practice, your assignments are ‘open assignments’ that you can complete with all
the resources you need at hand (notes, guides, online help). This cancels out a strategic
skill often observed among students: memorising large amounts of information for just
one occasional exam. Memorizing will not work at all for this course. Instead, you will
have to learn and practice regularly throughout the semester. If you are not used to
that method of study, then the course will be one critical opportunity to learn it.

4

!
The assignments are also cumulative: Assignment No. 1 will be revised and included in
Assignment No. 2, just as your final paper will draw extensively on the revised versions
of both assignments. To learn more on how to complete assignments, read the instructions provided in Part 3 of this guide.

1.3.

Communication

Let’s immediately clarify two things about student-instructor communication for this
particular course:
−

Email will be used for feedback and all correspondence. To simplify this process, we will use normalized email subjects.
A normalized email subject looks like “SRQM: Assignment No. 1, Briatte and
Petev”, where “SRQM” is an acronym for the course, and “Briatte and Petev”
are the family names for you and your study partner.
Normalized email subjects apply to all correspondence, not just when sending
assignments. To ask a question on recoding, you should hence use “SRQM:
Question on recoding”.
When working in pairs, always copy your partner when sending emails and assignments. Identically, send all your emails to both course instructors if you
want a reply (and especially a quick one).

−

You should ask questions in class and email additional ones, after having
implies that you take ownership of that material, rather than passively absorb it
as lecture notes.
You should not feel uncomfortable asking questions in class. Neither should
you expect others to ask questions for you. Unfortunately, you might have survived several courses doing precisely that until now, and might survive more in
the future doing the same. This course, however, runs on personalised projects
that do not allow exit or free riding.
The extra effort that is required from you on that side is the counterpart to offering a course where you are learning through practice rather than by rehashing a handbook into a standardized examination or abstract problem sets with
no empirical counterpart.

in sorting out large volumes of email. There are no direct sanctions for not following
this principle, but there are indirect costs, especially if your assignment emails get lost in
the instructors’ grading pile or if you end up waiting three weeks to send a question
that will get you late on your project.

5

!

1.4.

Research project

The course is built around your elaboration of a small-scale research project. Because
the course is introductory by nature, several limits apply:
−

You are required to use pre-existing data, instead of assembling your own data
and building your own dataset, which is a much longer process that require additional skills in data manipulation.

−

You are required to use cross-sectional data, because time series, panel and
longitudinal data require more complex analytical procedures that are not covered in introductory courses.

−

You are required to use continuous data, because discrete variables also use different techniques not covered at length in the course. This applies principally to

These requirements and their terminology are covered in Section 5. For now, just remember this basic principle: your research will be based on one dependent variable,
sometimes called the ‘response’ variable, and you will try to explain this variable by
predicting the different values that it takes in your sample (dataset) by using several
independent variables, sometimes called the ‘explanatory’ variables, or ‘predictors’, or
‘regressors’ in technical papers.
Because your research is a personal project, you might bend the above rules to some
extent if you quickly show the instructors that you can handle additional work with data management. The following advice might then apply:
−

If you are assembling your data by merging data from several datasets, you will
be using the merge command in Stata (Section 5.2). You might also choose to
use Microsoft Excel for quicker data manipulation. Do not assemble data if you
do not already have some experience in that domain.

−

If you are converting your data, refer to the course and online documentation
on how to import CSV data into Stata using the insheet command, or how to
convert file formats like SAV files for SPSS. Always perform extensive checks to
make sure that your data were properly converted into a readable, valid file.

−

If you are interested in temporal comparison, such as economic performance
before and after EU accession, you can compute a variable that will capture, for
example, the change in average disposable income over ten years. Stick with a

−

If you have selected nominal data as your dependent variable, such as religious
denomination, then something went wrong in your research design—unless you
know about multinomial logit, in which case you should skip this class. Please
identify a different variable that is either continuous or ‘pseudo-continuous’—
i.e. a categorical variable with an ordinal (or better, interval) scale, such as educational attainment.
6

!

1.5.

Guidance

This guide works only if you use it. Its writing actually started with students questions:
several sections were first written as short tutorials concerning specific issues with data
management. One thing led to another, and we ended up with the current document.
The aim is to cover 99% of the course by version 1.0. A handful of students also provided valuable feedback on the text—thanks!
The Stata Guide is a take on the course requirements: it offers a narrative for the
commands used, by tying them up together with core elements of statistical reasoning
and quantitative methods. The guide aims at covering the most useful introductory
concepts of statistics for the social sciences, and to offer a detailed exploration of these
concepts with Stata.
The ‘introductory’ term is important: the course focuses on selected commands and
options to work through selected operations of ‘frequentist’ statistics. To fully support
these learning steps, it also introduces basic computer and research skills that are often
missing to the training of students, and offers a way to assess all aspects of the course
through a small-scale research project.
Several sections of the guide are still in draft form, so watch for updates and read it
along other documentation. As explained in Section 2.5, there is a wealth of documentation out there, and you should be able to locate help files to complement the Stata
Guide on how to use Stata to analyse your data.

7

!

2. Computers
A course on quantitative methods is bound to make intensive use of computer software. You all use computers routinely for many different activities, but your level of
familiarity with some of the fundamental aspects of computers can vary dramatically.
Being reasonably familiar with computers is required for this class.
Please read this section in full and assess whether you are familiar enough with the
notions covered, otherwise you should start practicing as soon as possible. A reasonable level of familiarity with computers will help you with using Stata and completing
assignments, and will also generally come in handy.

2.1.

Basics

The course requires minimal computer skills. In order to open and save files in Stata,
you should be able to:
−

Locate files using their file path. In recent, common operating systems, a file
path looks like /Users/fr/Courses/SRQM/Datasets/qog2011.dta in Mac OS X,
used to these if you have never used them before.

−

Locate online resources using their URL. The URL for the course website is
http://f.briatte.org/teaching/quanti/. We will use URLs extensively when guiding you through coursework and course material.

−

Understand file and memory size, which is often displayed in megabytes (MB).
Using Stata 11 or below correctly requires setting memory to load large files: for
instance, set mem 500m sets memory to 500MB.

2.2.

Filenames

Filenames are another essential aspect of computer use, especially when you are handling a large number of files and/or using multiple copies of the same file. Some general recommendations apply:
−

In all cases, filenames should be short and informative. Regularly accessed
files, like datasets, have short filenames for faster manipulation, and contain the
time period covered by the data.

−

In some cases, filenames require normalization. This implies using sensible filenames and standard version numbers for files that are chronologically ordered.
This point is important because you will be required to normalize the filenames
for your work files in this class (Part 3).
8

!

2.3.

Equipment

Regarding computer equipment, you will need:
−

Access to a computer, both at university and at home. You should bring your
personal laptop to class if you own one. Make sure that you know how to work
with your computer and that it is fast enough.

−

A university email account subscribed to the course, and possibly a personal
email account to share larger files and to backup your work. The standard solution for an efficient work mailbox is Gmail.

−

Access to the ENTG, as provided by Sciences Po. The course will use the “Documents” pane to share the course emails and readings. Other files will be available from the course website.

−

A word processor to type in your final paper. Despite being a worldwide standard, Microsoft Word is unstable: always backup your work. Any solution is
good as long as it can be printed to a PDF file.

−

A working copy of Stata (our software of choice, introduced in Section 3) on
the computer(s) used during the course and at home. This point will be discussed during class. Stata includes a plain text programming editor.

−

A USB stick, to build a course ‘Teaching Pack’ by saving and organizing all
course material, as well as the files from your research project. Always make
regular backups of your data in at least two different locations.

Some of these items will be provided to you through Sciences Po. Please make sure that
you have equipped yourself as early as possible in the semester, and as indicated several times in the list above, always backup your work!

2.4.

The course will regularly require that you locate and download resources online, from
datasets to do-files, as well as other course material, mostly in PDF format or as ZIP archives. Make sure that you know how to handle these formats.
a “.txt” extension to your do-files, you will need to rename the file by turning its file
extension back to just “.do” to open it in Stata.
Google Chrome, Mozilla Firefox and Apple Safari are common Internet browsers with
appropriate “Save As” options available from their contextual menus. The example below shows the contextual menu for Google Chrome on Mac OS X.

9

!

All course material should be archived into a structured folder hierarchy, which will
depend on your own preferences and operating system. A simple hierarchy, such as
~/Documents/SRQM/ on Mac OS X, will let you access all files quickly.

2.5.

Help

Quantitative methods cannot be learnt once and for all: the course will require that
you frequently search for help, often from online sources. Always consult the course
material for each session before seeking additional help: the answer is very often just
before your eyes or a few clicks away.
If you are looking for help on a Stata command, use the help command to access the
very large internal documentation included in Stata. Even experienced users use help
pages on a daily basis. Learning to use Stata help pages is a course objective in itself.
Stata help command: http://www.stata.com/help.cgi?help
If you are looking for help on statistics, please first refer to the course readings listed in
the course syllabus. Feinstein and Thomas’ Making History Count (Cambridge University Press, 2002) is the main handbook for this course; help on graphics and other topics
If you are looking for help on statistical procedures in Stata, please first refer to the
course website for a selection of Stata tutorials. Two American universities, Princeton
and UCLA, have produced excellent Stata tutorials that cover similar material than the
course sessions. More tutorials are available online.
Course website: http://f.briatte.org/teaching/quanti/
If you are stuck, do not panic! Please first make sure that you have explored the software and course resources listed above. It is safe to assert that 99% of Stata questions
for this course can be answered from the course material. If still stuck, try a Google
search on your question: thousands of online sources hold answers to identical questions asked by Stata users around the world. Researchers often check the Statalist and
the statistics section of the StackOverflow website for answers to their own questions.
Finally, if still stuck, and in this case only, email us to ask your questions directly. It
would be preferable if email correspondence could be limited to questions on your research design, rather than questions that could be answered by simply reading the
course material mentioned above.

10

!

2.6.

Commands

Stata can be used either through its Graphical User Interface (GUI), like most software, or through a ‘command line’ terminal, which is a very common aspect of programming environments. As explained just below, learning how to use the command
line and writing do-files are compulsory for replication purposes.
The ‘command line’ terminal works by entering lines of instructions that are reproduced, along with their results, in another window. The next sections of this guide explain further how Stata works and document the usual commands used in this course
for data management, description, analysis and graphing. A ‘cheat sheet’ for these
commands is offered in Section 12.
The screenshot below shows an example of such commands, typed manually in the
Command window (top); after running them by pressing Enter, their output showed in
the Results window (bottom).

11

!
Command line terminals work by entering commands, such as set mem 500m (which
assigns 500MB of computer memory in Stata 11–). When you press Enter, Stata will try
to execute, or ‘run,’ the command, which might occasionally take a little bit of time if
your data or command are computationally intensive.
If your command ran successfully, Stata will display its result:
. set mem 500m
Current memory allocation

settable

current
value

set maxvar
set memory
set matsize

memory usage
(1M = 1024k)

description

5000
500M
400

max. variables allowed
max. data space
max. RHS vars in models

2.105M
500.000M
1.254M
503.359M

Note that some commands produce ‘blank’ outputs in the Results window, i.e. the
command was successfully entered and executed, but there is no indication of its actual
result(s). In these cases, a simple “.” line dot will appear in the Results window, as to
show that Stata encountered no problem while executing the command, and that it is
If the command is not valid, which often happens due to typing errors or for other reasons related to how Stata works, it will display an error or a warning. In that case, you
have to fix the issue, often by re-typing the command correctly (as in the ‘summarizze’
example below), or by checking the documentation to understand where you made a
mistake and how to fix it.
. summarizze age
unrecognized command:
r(199);

summarizze

. summarize age
Variable

Obs

Mean

age

24291

46.81392

Std. Dev.

Min

Max

17.16638

18

84

Stata commands are case-sensitive. Use only lowercase letters when typing commands.
Variables can come in both uppercase and lowercase letters, which you will have to
type in exactly to avoid errors. As a rule of thumb, when creating variables, use only
lowercase letters.
If you need to correct an invalid command or re-run a command that you have already
used earlier on, you can use the PageUp or Fn-UpArrow keys on your keyboard to
browse through the previous commands that you typed, which are also displayed in the
Review window.

12

!
Some commands can be abbreviated for quicker use. If you run the help summarize
command in Stata, the help window will tell you that the summarize command can be
shorthanded as su:
. su age
Variable

Obs

Mean

age

24291

46.81392

Std. Dev.

Min

Max

17.16638

18

84

Abbreviations exist for most commands and come in handy especially with commands
such as tabulate (shorthand tab), describe (shorthand d) or even help (shorthand h).
They also work for options like the detail option for the summarize command:
. su age, d
Age
Percentiles

Smallest

1%

18

18

5%

21

18

10%

24

18

Obs

24291

25%

32

18

Sum of Wgt.

24291

50%

46

75%

60

84

90%

71

95%

77

99%

83

Mean

46.81392

Std. Dev.

17.16638

84

Variance

294.6846

84

Skewness

.234832

84

Kurtosis

2.09877

Largest

Some commands have particular attributes. Comments, for example, are lines of explanation that start with * or //. They are not executed, but are necessary to make your
do-files and logs understandable by others as well by yourself. The first and third lines
in the example below are comments.
. * Creating a variable for Body Mass Index (BMI).
. gen bmi = weight*703/height^2
. * Summary statistics for BMI.
. tabstat bmi, s(n mean sd min median max)
variable

N

mean

sd

min

p50

max

bmi

24291

27.27

5.134197

15.20329

26.57845

50.48837

You will find many comments in the course do-files: use them to describe what you are
doing as thoroughly as necessary. You will be the first beneficiary of these comments
when you reopen your own code after some time.
13

!
Additional commands can be installed. Stata can ‘learn’ to ‘understand’ new commands through packages written by its users, most often academics with programming
skills. We will use the ssc install command at a few points in this guide to install some
of these packages. Installation with ssc install requires an Internet connection.
Right away, you should install the fre command in Stata by typing ssc install fre, as we
will use this command a lot to display frequencies. Other handy commands like catplot,
spineplot or tabout will be installed throughout the course, as in the example below,
which shows possible installation results:
installation complete.
. ssc install fre
checking fre consistency and verifying not already installed...
all files already exist and are up to date.

2.7.

Replication

In this guide, terms like commands, logs and do-files collectively designate an essential
aspect of quantitative methods: replication, i.e. providing others as well as yourself with
the means to replicate your analysis.
Replication requires that you keep your original files intact. The dataset that you will
use for your research project should be left unmodified in your course folder, and
should be provided along with your other files when handing in assignments.
Replication also requires the list of commands you used to edit your data, for instance
to drop observations or to recode variables, as well as the commands that you used to
analyse the data, such as tabstat, histogram and regress. The commands can all be
stored into a single text file, with one command appearing on each line: this structure is
common to computer scripts and programs.
A do-file is a text file that contains your commands and comments. A log file is a separate text file that contains these commands, along with their results. The production of
do-files will be practiced in class, and additional documentation appear in many Stata
tutorials listed in the course material.
Replication files are a crucial aspect of programming. If you open the do-files for this
course in the Stata do-file editor or in any other Stata-capable editor, you will notice
that the files feature line numbers and a coloured syntax. These generic features are
built in most programming environments.
Learning to understand and write in programming languages takes time, and therefore
constitutes a particular skill. Writing do-files in Stata requires learning commands and
their syntax, exactly like languages require learning vocabulary and grammar. Just like
14

!
with languages, the learning curve also decreases once you already know one. Finally,
since programming can also reflect high or low writing skills, you should read the coding recommendations in Section 13.3 on coding before submitting your own work.

15

!

3. Stata
Our course uses a recent version of Stata, a common software choice in social science
disciplines. Using any statistical software requires some basic skills in file management
and programming. The following steps apply to virtually any Stata user, and should be
practised until you are familiar enough with them.
Once you start exploring the Stata interface, you will realize that most windows can be
hidden to concentrate on your commands, do-files and results, as below. We won’t use
any other element of the interface in this course, but feel free to explore Stata and use
other functionalities.

3.1.

Command line

Stata has a graphic user interface (GUI) and a command line system. The latter is much
more versatile and teaches you the syntax used by Stata. More importantly, the command line forces you to plan your work with data management and analysis.
Because the commands entered through the command line can be recorded, i.e. stored
as logs (see below), it will enable you to maintain a record of your operations and to
store comments along. This step is essential to keep afloat with your own work, as well
as to share it with others, usually as do-files.
The Stata GUI can be used occasionally for routine operations that need not appear in
your do-files. Keyboard shortcuts also save some time, as with ‘File > Open…’ (Ctrl-O
in Stata for Windows) or ‘File > Change Working Directory…’ (Cmmd-Shift-J in Stata
for Macintosh; this document uses ‘Cmmd’ to designate the ‘⌘’, a. k. a. ‘Command’,
‘Cmd’ or ‘Apple’ modifier key).

16

!

3.2.

Memory

Stata 12 works memory on its own, but older versions of Stata usually open with a very
small memory allocation for data. To safely open large files in Stata 11 or below, we
recommend you run the set mem 500m command to allocate it 500MB.
Very large datasets might require allocating more memory, using a different version of
Stata with enhanced capacities, or even switching from Stata to software with higher
computational power. This course will not require doing so.

3.3.

Working directory

The working directory is the folder from which Stata will open and save files by default.
You will have to set the working directory every time you launch Stata. The path to
To learn what is the current work directory, use the pwd command. To set it to a new
location, type cd followed by the path to the desired folder. To list the contents of the
working directory, use the ls or dir command.
Your working directory should be the main ‘SRQM’ folder for this course, which we
also call the ‘Teaching Pack’ because you will be required to download all course material to it. Download the Teaching Pack from the course website and unzip it to an easily
accessible location, such as your Documents folder.
The example below reflects all directory commands for a user called ‘fr’ using Mac OS X
to change the Stata working directory from the user’s Desktop to the SRQM folder,
. pwd
/Users/fr/Documents
. cd ~/Documents/Teaching/SRQM/
/Users/fr/Documents/Teaching/SRQM
. ls, w

Datasets/

Software/

Course/

Replication/

emails.txt

website.url*

The quotes around the file path are optional in this example, but are compulsory if your
file path contains spaces. The ls command above was given the wide (shorthand w)
option to make its output simpler to understand.
If you are unsure what the path to your SRQM folder is, do not just ignore this step as
if it were optional. Select ‘File > Change Working Directory…’ in the Stata menus, and
from there, select your SRQM folder.

17

!

3.4.

Open/Save

Stata can use the usual open/save routine that you are familiar with from using other
software. It can also open datasets and save them from the command line if you have
correctly set your working directory in the first place.
The example below shows how to download a Stata dataset from an online source and
then save it on disk. The use command with the clear option removes any previously
opened dataset from memory, and the save command with the replace option will
overwrite any pre-existing data:
. use http://f.briatte.org/teaching/quanti/data/trust.dta, clear
. save datasets/trust.dta, replace
file datasets/trust.dta saved

In this course, you will never have to save any data: instead, you should leave all datasets intact and use do-files to transform them appropriately. This will ensure that your
work stays entirely replicable.

3.5.

Log files

The log is a text file that, once open with log using, will save every single command
you enter in Stata as well as its results. Systematically logging your work is good practice, even when you are just trying out a few things. Logs can be closed with the log
close command followed by the name of your log if it has one:
. log using example.log, name(example) replace
name:
log:
log type:
opened on:

example
/Users/fr/Documents/Teaching/SRQM/example.log
text
17 Feb 2012, 00:01:42

. use datasets/qog2011, clear
. * Count countries with 0% malaria risk in 1994.
. count if sa_mr==0
76
. log close example
name:
log:
log type:
closed on:

example
/Users/fr/Documents/Teaching/SRQM/example.log
text
17 Feb 2012, 00:02:53

18

!
Comments will also be saved to the log file, which is particularly useful when you have
to read through your work again or share it with someone. In the example above, all
comments, commands and results were saved to the log file.

3.6.

Do-files

Logs are useful to save every operation and result from a practice session. If you need
someone else to replicate your work, however, you just need to share the commands
you entered, along with the comments that you wrote to document your analysis. Files
that contain commands and comments are called do-files.
Writing do-files is a crucial aspect of this course. Absent of a do-file, your work will be
mostly incomprehensible, or at least impossible to reproduce, to others. Your do-file
should include your comments, and it should run smoothly, without returning any errors. You will discover that these steps require a lot of work, so start to program early.
To open a new do-file, use either the doedit command or the ‘File > New Do-file’ menu
(keyboard shortcut: Ctrl-N on Windows, Cmmd-N on Macintosh).
You should take inspiration from the do-files produced for the course to write up your
own do-file for your research project. All our do-files are available from the course website. This course requires only basic programming skills, as illustrated by the do-files that
we run during our practice sessions. More sophisticated examples can be found online.
To execute (or ‘run’) a do-file, open it, select any number of lines, and press Ctrl-D in
Stata for Windows or Cmmd-Shift-D in Stata for Macintosh. You can also use either the
GUI icons on the top-right of the Do-file Editor window, or use the do or run commands. Use the Ctrl-L (Windows) or Cmmd-L (Macintosh) keyboard shortcuts to select
the entire current line in order to run it.
Get some practice with do-files as soon as possible, since your coursework will include
replicating one do-file a week. Replicating is nothing more than reading through the
comments of a do-file, while running all its commands sequentially.

3.7.

Shutdown

When you are done with your work, just quit Stata like you would quit any other program. At that stage, any unsaved operation will be lost, so make sure that your do-file
contains all the commands that you might want to replicate.
To quit with the command line, use log close _all to tell Stata to close all logs, and then
type exit, clear to erase any data stored in Stata memory and quit. Alternatively, just
exit Stata like any other program to close logs and clear data automatically.
Remember not to save your data on exit (Section 3.4).

19

!

3.8.

Alternatives

This course uses Stata (by StataCorp) as its statistical software of choice. Stata is commonly used by social scientists working with quantitative data in areas such as economics and political science. It is a powerful solution that provides a good middle ground
between spreadsheet editors and R, which is the most powerful–and least expensive,
since it’s free and open source–but also the most difficult statistical choice of software.
Stata is also more advanced than SPSS because of its emphasis on programming, which
has led to the development of a large set of additional packages. Most statistical procedures know some form of implementation in Stata, and the software is supported by a
large user community that meets on the Statalist mailing-list.
Stata has a few limitations. Its graphics engine is not bad, but not excellent either. It is
not as capable as SAS with large datasets, nor as focused on a particular approach to
quantitative analysis as EViews for econometrics. Finally, unlike free and open source
software, it is a commercial product.
Within these limitations, Stata remains an appropriate solution for the kind of procedures that you will learn to use during this course. Its programming features, operated
through the command line, are central to the learning objectives of the course.
The Stata website will tell you more about the different versions of Stata. It also holds
an online version of the software documentation: http://www.stata.com/. The website
also links to Stata books, journals, and to the Statalist mailing list.
If you are planning to continue using quantitative methods during your degree, you
should also start learning more about R as soon as you are familiar with Stata. Alternatives to Stata are documented in the course material.

20

!

4. Research
The course is built around small research projects, on which you will write your final
paper. Every student (grouped in pairs when applicable) is expected to participate,
which requires some basic knowledge of scientific reasoning.
Scientific research aims at establishing theories of particular knowledge items, such as
elections (political science), continents (geography), international trade (economics),
history, proteins, galaxies and so on. All these items are grounded in real events that
are partially processed through theoretical models of what they represent: competitions
of political elites within the structural constraints of partisan realignments, drifting tectonic plates on top of the lithosphere, markets dominated by agents interested in macroeconomic performance, representations of particular historical events, biological compounds of amino acids, gravitational systems of stars… Our collective knowledge of
reality is directly mediated by these abstract conceptions.
Quantitative social science explores some particular phenomena of usually large scale,
in order to produce complex explanatory models that follow a common set of rules
with the ones cited above. Precisely, it looks for the regularities and mechanisms that
intervene in the distribution of social events such as military conflict, economic development or democratic transitions, all of which tend to happen under particular conditions at different points of space and time. The aims of quantitative social science consist in building theories that simplify these conditions by pointing at the specific variables that might intervene in causing the events under scrutiny.
The final model used in this course, linear regression, offers one possible way of identifying these variables, by looking at how a set of independent (explanatory) variables
can predict a fraction of another dependent (explained) variable. Can we understand,
for example, the spread of tuberculosis in a country by looking only at the different levels of sanitation in a sample of the world? Is it the case that the support of violent action decreases with age and education? Are states more likely to be concerned by environmental issues when they possess a high level of national wealth? Or is it rather the
case that their attention varies in function of their own exposure to, for instance, natural disasters?
Thousands of researchers spend their whole lives on similar questions. Several millions
of theories exist on all aspects of the real (natural, material) world.

4.1.

Comparison

A fundamental motive behind theory building lies in comparison. Our units of observations, such as individuals or countries, express different characteristics that can be compared with each other. Nation states, for instance, express various levels of authority
over their citizens, to the point where we can (or at least wish to) distinguish some po21

!
litical systems as democracies—structures of authority that are ultimately controlled by
citizens through means such as open elections. Identically, some nation states go
through periods of acute political disruption that lead to social revolutions. Even more
fundamentally, some nation states hardly qualify to that title: the extent to which states
and nations coincide also varies from a country to another. These questions are fundamental issues in comparative politics (the selection of issues above come from a course
by David Laitin at Stanford University). Similar research questions structure all other
fields of social science, from economic history to analytical sociology.
To understand the variety of political configurations in (geographical) space and in (historical) time, social science researchers formulate arguments in which they posit explanatory factors, which we will call independent variables. Continuing with the examples
above, an early explanation of democracy is Montesquieu’s theory that climate influences political activity, and an early explanation of social revolution is Marx’s theory of
class structure. Both authors examined particular cases of democracies and revolutions,
and then derived a particular theory from their observations. Modern theories tackle the
same issues, but provide different explanations, using factors such as elections,
state/society interactions, or the precise timing of industrialization in each country under examination.
Advances in social science consist in providing analytically more precise concepts and
typologies for the phenomena under study. Revolutions, for instance, are now studied
under several categories, which distinguish, for instance, “white” (non-violent) revolutions from other ones. By doing so, researchers improve the specification of these phenomena, which we will technically designate as our dependent variables. The deep
anatomy of these social phenomena nonetheless poses a constant challenge to scientists, since before we can start understanding their causes, we need to define and conceptualize complex phenomena such as “civil war”, “counter-insurgency”, “morality”
or “identity”.
The quantitative analysis of social phenomena cannot solve any of these issues, but it
can contribute to improving our knowledge of concept formation, theory building and
comparison across units of observation.

4.2.

Theory

Formally, theory building starts with a certain knowledge of scientific advances in a given field. A certain amount of knowledge already exists, for example, on why young
mothers abort, or on how durable peace occurs and then persists between nation
states. Everything that you know from your previous courses in the social sciences will
be useful in thinking about your data, especially what you have learned in the fields of
demography, economics, public health or sociology. Once previous knowledge has
been considered, however, the unique method of verification that exists for a particular
phenomenon is its observation.

22

!
Several methods of observation coexist. All of them, and not just quantitative methods,
are based on structured comparisons of different units of observation, would it be protesters in a public demonstration, voters in an election, young mothers in an abortion
clinic, national governments in a technological race, or random members of the public.
Observations are then produced either through experimental or through observational
studies, both of which provide a number of facts, such as a response rate to a question
or the adoption of a particular behaviour. These facts are collected in order to build
theories to explain why and how they occur.
When it is impossible to work on all instances of a phenomenon in the material world,
such as the development of cancer cells or the occurrence of revolutions, scientists focus their attention on carefully selected samples of observations and then generalise
their findings to a larger number of observations. Scientific theories therefore exist for
phenomena as diverse, as common and as important as the democratic election of extreme-right parties, or the effect of radiations on the physiological status of human beings.
These operations of theory building guide scientific inquiry. Additional principles concern the rules under which we construct theoretical models. A crucial rule of science
consists in the suppression of all personal judgement over the data (objectivity), in order to formulate statements that hold generally true rather than only towards a given
end (normativity). Social science is a branch of inquiry where these principles are particularly difficult to follow, but where they apply nonetheless, and where they allow to
formulate scientific statements on several aspects of social interaction, from suicide terrorism to divorce, from increases in exports to changes in political leadership.

4.3.

Quantitative social science

Quantitative approaches to social science apply the aforementioned scientific rules in
order to identify variations in events that involve a number of units such as people,
states, elections or civil wars, and that we describe through a certain number of characteristics that vary from a unit to another—variables.
An example of quantitative result relates to presidential approval: social surveys that
measure the extent to which people tend to support their presidents have found that
economic performance is often very influential in determining that support. Theoretical
models, such as David Easton’s systemic theory of political inputs and outputs, support
that kind of finding. Identically, health expenditure has been measured in Western
countries for several decades. Variations in health spending seem easily explained by
variations in life expectancy, but also by the increasing costs caused by improvements
in medical technology. Current data contribute to explain that phenomenon: health expenditure growth does not directly depend on the age structure of a country and on
the longevity of its residents, but rather on the health status and behaviour of its individuals, which themselves happen to vary with age.

23

!
Variables appear in these results and in their explanatory theories. Economic performance, for instance, is a variable often measured through unemployment levels, gross
domestic product, public deficits and annual changes in per capita disposable income.
Identically, health expenditure and health behaviour are also measured through complex computations of health services supply and demand. The processes and mechanisms that causally connect these variables come from quantitative, qualitative and also
from theoretical research, in order to provide causal efficacy to the correlations that we
observe.
There are multiple sources of error in that process. One of the most important comes
from the measurement of our variables: a survey question can contain unwanted incentives to answer in a particular way, or it can simply be confused and misleading, or the
answer to a question can be misinterpreted. The careful creation of concepts for complex phenomena such as racism, political identity or illness solves part of that issue.
Valid and reliable data are then used to test particular hypotheses, such as the presumption that education and xenophobia are negatively correlated, or that economic
growth is proportionate to the openness of national economies to all possible competitors. Quantitative social science verifies, or nullifies, these kinds of hypotheses, based
on various sources of data, or statistics.

4.4.

Social statistics

Quantitative data come in the form of datasets, which themselves are numeric collections of variables for a given set of units of observation. The example below is taken
from the U.S. National Health Interview Survey (NHIS):

−

The rows hold observations: each row of numeric data designates the answers
of one individual respondent (the unit of observation).

−

The columns hold variables: each column designates designate to a particular
question, such as gender, earnings, health status and so on.

In this example, some variables can be ordered: health, for instance, is based on a selfreported measure that ranges from “poor” to “excellent”. Other variables take values
that cannot be ordered: raceb, for instance, corresponds to the respondent’s racialethnic profile, for which there is no ordering. Other variables have only two possible
values, such as sex (either male or female) or insurance status (either covered or not).
24

!
These are examples of different types of variables that come in addition to ‘purely numeric’ ones like weight, measured in pounds.
Units of observations are not necessarily individuals: they can be anything from organizations to historical events (Section 5.1). Sometimes, not all variables can be measured
for all observations: there will be missing values. The example below is taken from the
Quality of Government (QOG) dataset:

The Quality of Government dataset uses countries as its unit of analysis. Due to several
difficulties with data collection and measurement, it shows an important number of
missing observations for several variables: the “bl_asyt15” and “bl_asyt25” variables,
for instance, measure the average number of schooling years among the population,
and have a high number of missing values. There are also some missing values in the
column for the iaep_es variable, which holds the legislative electoral system for each
observation.
The combined effect of sampling and missing observations forces us to work on a finite
number of observations, which introduces a further risk of error when we start analysing the data. Furthermore, we will use more than one model, as the type of variables
under examination calls for different statistical procedures. Similarly, the number of variables influences these procedures:
−

The distribution of one variable, such as the number of democracies observed
at a given point in time or the proportions of each religious group in a given
population in 2004, is captured by univariate statistics. These statistics allow to
calculate to what extent our sample might be different with respect to the universe of data that we are sampling from, such as the whole population of a
country, all countries or all instances of civil war. This standard error will appear
in all our statistical procedures.

−

The relationship between two variables, such as racism and income or national
wealth and defence spending, is addressed by bivariate tests. These tests provide the probability that a relationship observed within our sample could be
caused by mere chance. This crucial statistic is called the p-value: only when it
stays under a certain level of significance will we accept that an observed relationship is statistically significant.

−

The relationship between two or more variables can also be modelled into an
equation, such as productivity = α · technology + β · education. Formally, mod25

!
els include an error term ε in the equation, to account for the sampling error
previously mentioned. Identically to bivariate tests, models also come with a pvalue for us to decide whether they can be confidently followed or not.
These three types of procedures are the essential building blocks of the course, and of
quantitative analysis in general. Because they require thinking about so many different
factors at the same time, they also require to think about data and analysis in a particular manner—statistical reasoning, the primary teaching objective of this course.
More details on the statistical operations covered in the course appear in the course syllabus, which you should read before reading the next section on data. You should also
read a few pages of quantitative social science before going further, as to make sure
that you understand the kind of research that you will be learning to perform, using
some introductory procedures.

4.5.

Depending on your experience with quantitative analysis and on your general themes
of interest, you should read at least four of the texts below, after making your own selection based on personal interests. If you are not familiar with political science, you will
want to include Charles Cameron’s presentation of quantitative analysis in that discipline.
−

Do not try to understand in full the methods used by the authors: concentrate
on the style of writing and reasoning instead, as well as on the particular form
of research question that quantitative researchers examine in different disciplines.

−

If you have little experience with either quantitative analysis or with scientific
writing, you will need more and not less from that list. Actually, you should
read the full list if you have no experience with either, and stop only when you
feel familiar enough with the material.

−

The reading of these texts is unmonitored, and left entirely up to you to
organise. You might want to read at least two texts in the first two weeks, then
one more before writing up each assignment, and finally one last before writing

the reading by Cameron, then read either one of the Gelman et al. texts or the Bartels
one, and then read either Jordan or Tavits.
−

Larry M. Bartels, Unequal Democracy. The Political Economy of the New Gilded Age, Princeton University Press, 2008, chapter 5.

26

!
In this chapter I explore four important facets of Americans’ views about equality. First, I examine public support for broad egalitarian values, and the social bases and political consequences of that support. Second, I examine public attitudes toward salient economic groups, including rich people, poor people, big
business, and labor unions, among others. As with more abstract support for
egalitarian values, I investigate variation in attitudes toward these groups and
the political implications of that variation. Third, I examine public perceptions of
inequality and opportunity, including perceptions of growing economic inequality, normative assessments of that trend, and explanations for disparities in economic status. Finally, I examine how public perceptions of inequality, its causes
and consequences, and its normative implications are shaped by the interaction
of political information and political ideology.
−

Charles Cameron, “What is Political Science?” in Andrew Gelman and Jeronimo
Cortina, A Quantitative Tour of the Social Sciences, Cambridge University
Press, 2009, chapter 15.
Politics is part of virtually any social interaction involving cooperation or conflict,
thus including interactions within private organizations (“office politics”) along
with larger political conflicts. Given the potentially huge domain of politics, it’s
perfectly possible to talk about “the politics of X,” where X can be anything
ranging from table manners to animal “societies.” But although all of these are
studied by political scientists to some extent, in the American academy “political
science” generally means the study of a rather circumscribed range of social
phenomena falling within four distinct and professionalized fields: American politics, comparative politics, international relations, and political theory (that is,
political philosophy).

−

Ashley M. Fox, “The Social Determinants of HIV Serostatus in Sub-Saharan Africa: An Inverse Relationship Between Poverty and HIV?” Public Health Reports
125(s4), 2010.
Contrary to theories that poverty acts as an underlying driver of human immunodeficiency virus (HIV) infection in sub-Saharan Africa (SSA), an increasing
body of evidence at the national and individual levels indicates that wealthier
countries, and wealthier individuals within countries, are at heightened risk for
HIV. This article reviews the literature on what has increasingly become known
as the positive-wealth gradient in HIV infection in SSA, or the counterintuitive
finding that the poor do not have higher rates of HIV. This article also discusses
the programmatic and theoretical implications of the positive HIV-wealth gradient for traditional behavioral interventions and the social determinants of health
literature, and concludes by proposing that economic and social policies be leveraged as structural interventions to prevent HIV in SSA.

−

Andrew Gelman et al., Red State, Blue State, Rich State, Poor State. Why Americans Vote the Way They Do, Princeton University Press, 2008, chapter 2.
27

!
This book [chapter] was ultimately motivated by frustration at media images of
rich, yuppie Democrats and lower-income, middle-American Republicans—
archetypes that ring true, at some level, but are contradicted in the aggregate.
Journalists are, we can assume, more informed than typical voters. When the
news media repeatedly make a specific mistake, it is worth looking at. The perception of polarization is itself a part of polarization, and views about whom the
candidates represent can affect how political decisions are reported. And, as we
explore exhaustively, the red–blue culture war does seem to appear in voting
patterns, but at the high end of income, not the low, with educated professionals moving toward the Democrats and managers and business owners moving
toward the Republicans.
−

David Karol and Edward Miguel, “The Electoral Cost of War: Iraq Casualties and
the 2004 U.S. Presidential Election”, Journal of Politics 69(3), 2007.
Many contend that President Bush’s reelection and increased vote share in 2004
prove that the Iraq War was either electorally irrelevant or aided him. We present contrary evidence. Focusing on the change in Bush’s 2004 showing compared to 2000, we discover that Iraq casualties from a state significantly depressed the President’s vote share there. We infer that were it not for the approximately 10,000 U.S. dead and wounded by Election Day, Bush would have
won nearly 2% more of the national popular vote, carrying several additional
states and winning decisively. Such a result would have been close to forecasts
based on models that did not include war impacts. Casualty effects are largest in
“blue” states. In contrast, National Guard/Reservist call-ups had no impact beyond the main casualty effect. We discuss implications for both the election
modeling enterprise and the debate over the “casualty sensitivity” of the U.S.
public.

−

Rachel Margolis and Mikko Myrskyla “A Global Perspective on Happiness and
Fertility”, Population and Development Review 37(1), 2011.
The literature on fertility and happiness has neglected comparative analysis. we
investigate the fertility/happiness association using data from the world values
Surveys for 86 countries. we find that, globally, happiness decreases with the
number of children. this association, however, is strongly modified by individual
and contextual factors. most importantly, we find that the association between
happiness and fertility evolves from negative to neutral to positive above age
40, and is strongest among those who are likely to benefit most from upward
intergenerational transfers. in addition, analyses by welfare regime show that
the negative fertility/ happiness association for younger adults is weakest in
countries with high public support for families, and the positive association
above age 40 is strongest in countries where old-age support depends mostly
on the family. overall these results suggest that children are a long-term invest-

28

!
ment in well-being, and highlight the importance of the life-cycle stage and
contextual factors in explaining the happiness/fertility association.
−

Patrick Sturgis and Patten Smith, “Fictitious Issues Revisited: Political Interest,
Knowledge and the Generation of Nonattitudes”, Political Studies 58(1), 2010.
It has long been suspected that, when asked to provide opinions on matters of
public policy, significant numbers of those surveyed do so with only the vaguest
understanding of the issues in question. In this article, we present the results of
a study which demonstrates that a significant minority of the British public are,
in fact, willing to provide evaluations of non-existent policy issues. In contrast to
previous American research, which has found such responses to be most prevalent among the less educated, we find that the tendency to provide ‘pseudoopinions’ is positively correlated with self-reported interest in politics. This effect
is itself moderated by the context in which the political interest item is administered; when this question precedes the fictitious issue item, its effect is greater
than when this order is reversed. Political knowledge, on the other hand, is associated with a lower probability of providing pseudo-opinions, though this effect is weaker than that observed for political interest. Our results support the
view that responses to fictitious issue items are not generated at random, via
some ‘mental coin flip’. Instead, respondents actively seek out what they consider to be the likely meaning of the question and then respond in their own
terms, through the filter of partisan loyalties and current political discourses.

29

!

Part 1

Data
Quantitative data is a particular form of data that simplifies information into variables
that take different types and values. Some variables, such as gross domestic product or
monthly income, hold continuous data that are strictly numeric, while others hold categorical data, such as social class for individuals or political regime type for nation states.
The collection of data for quantitative analysis systematically creates issues of measurement and reliability that also apply to qualitative research. These issues are usually
explored by classes that focus on social surveys and research design. In this course, we
assume that you already know of some of the issues that apply to data collection.
Manipulating a dataset is a complex task that requires some familiarity with the structure of the data, with the software commands available to prepare the data, and with
we will in this course, will simplify these operations a great deal, but will not entirely
suppress them.
This section describes the essential steps that you should follow to prepare your data
before starting your analysis. The four sections are better read as just one block of the
guide, as they frequently overlap. If you are using a pre-assembled Stata dataset that
comes in cross-sectional format, skip Sections 7.1 to 7.3.

30

!

5. Structure
In quantitative environments, information is stored in datasets that hold observations
and variables. Understanding the structure of your data is an absolute requirement to
its analysis, for the following reasons:
−

The studies that motivate data collection have different goals. This course will
cover only observational studies using cross-sectional data (Section 5.1).

−

The observations contained in a dataset generally consist in a sample taken
from a larger population. The representativeness of your data depends on how
that sample was initially constructed (Section 5.2).

−

The variables of a dataset consist of numerical, text or missing values assigned
to each observation, following a consistent level of measurement (Section 5.3).

This section briefly reviews each of these aspects.

5.1.

Studies

All quantitative studies use samples, variables and values, but distinctions apply among
them given the wider research strategy for which the data were collected. The principal
issue at stake is the type of randomization employed in the study:
−

Experimental studies designate research designs where the observer is able to
interact with the subjects or patients that compose the sample. Experimental
settings are common in psychology and clinical studies, where subjects or patients are often randomly assigned to a ‘treatment’ and a ‘control’ group to
study the effects of a particular drug or setting on them. These studies generally
rely on small samples and on an analysis of variance (ANOVA).

−

Observational studies designate research designs where the observer is not able
to interact with the sample. The randomized component does not have to do
with assigning treatments but with randomly sampling observations, which are
most often individuals from a given population. Such studies are extremely
common in research that focuses on social and political ‘treatments’ (such as
environmentalism or drug addiction) that cannot be assigned to subjects.

This course explores non-experimental data collected in observational studies. A further
distinction applies between these studies, depending on the period of observation for
which the data were collected:
−

Cross-sectional studies are collected at one particular point in time and provide
‘snapshots’ of data in a given period, such as political attitudes in the American
population a few days after September 2001, or health expenditure levels in EU
member states in 2010–2011.
31

!
−

Time series are collected at repeated points in time. In the case of crosssectional time series (CSTS), a different sample is collected at each point. If the
same sample is used throughout, the study provides longitudinal information
on a given ‘panel’ or ‘cohort’, such as U.S. households or OECD countries.

This course will focus on cross-sectional data, which are the most readily available because of the sunk costs of collecting longitudinal data. Many common forms of surveys,
such as opinion polls, are cross-sectional, although larger research surveys often have a
panel component that involve following a group of individuals over several years.
Cross-sectional data has its own statistical limits. Although it allows comparing across
observations, it provides no information on the changes that occur through time within
and between the units of analysis. That information appears in longitudinal data, which
require additional statistical methods outside of the scope of this course.

5.2.

Sampling

An observation is one single instance of the unit of analysis. The unit of analysis is a
unique entity for which the data were collected, and can be virtually anything as long
as a clear definition exists for it. Voters, countries or companies are common units of
analysis, but events like natural disasters and civil wars are also potential candidates.
The definition of the unit of analysis sets the population from which to sample from.
For instance, if you are studying voting behaviour in France, your study is likely to apply
only to the French adult population that was allowed to vote at the time you conducted
your research. Dataset codebooks usually discuss these issues at length.
The sample design then sets how observations were collected. The various techniques
that apply to sampling form a crucial component of quantitative methods, that can be
broken down to a few essential elements that you will need to understand in order to
assess the representativeness of your data:
−

The sample size designates the number of observations, noted N, contained in
the dataset. Since variables often have missing values, large segments of your
analysis might run on lower number of observations than N (Section 6.3).
Sample size affects statistical significance through sampling error, which characterises the difference between a sample parameter, such as the average level of
support for Barack Obama in a sample of N respondents, and a population parameter, such as the actual average level of support for Barack Obama in the
full population of U.S. voters (the size of which we might know, or not).
The Central Limit Theorem (CLT) shows that repeated sample means are normally distributed around the population mean.
Sampling error is calculated using the standard error, from which are derived
confidence intervals for parameters such as the sample mean. The standard er32

!
ror decreases either with the level of confidence of an estimate, or with the
square root of the sample size. Consequently, the law of large numbers applies:
larger sample sizes will approach population parameters better and are preferable to obtain robust findings.
−

The sampling strategy designates the method used to collect the units contained within the sample (i.e. the dataset) from a larger universe of units, which
can be a reference population such as adult residents in the United States or all
nation-states worldwide at a given point in time.
Sampling strategy has an impact on representativeness. Surveys often try to
achieve simple random sampling to select observations from a population, using a method of data collection designed to assign each member of the population an equal probability of being selected, so that the results of the survey can
be generalised to the population.
Random or systematic sampling from particular strata or clusters of the population are among the methods used by researchers to approximate that type of
representativeness. These methods can only approximate the whole universe of
cases, as when a study ends up containing a higher proportion of old women
than the true population actually does, which is why observations in a sample
will be weighted in order to better match the sample with its population of reference.
Other methods of data collection rely on nonprobability sampling. When the
unit of analysis exists in a small universe, such as states or stock market companies, or when the study is aimed at a particular population, such as Internet users or voters, the sampling strategy targets specific units of analysis, with results
that are not necessarily generalizable outside of the sample.
The sampling strategy can correct for design effects such as clustering, systematic noncoverage and selection bias, all of which negatively affect the representativeness of the sample. Representativeness can be obtained through careful research design and weighted sampling. Stata handles complex survey design with several weights options passed to the svyset and svy: commands,
both covered at length in the Stata documentation.

Important: Neither sample size or sampling strategy will remove measurement errors
that occur at earlier or later stages of data collection. Representativeness is only one
aspect of survey design. On the one hand, it is technically possible to collect representative answers to very poorly written survey questions that will ultimately measure
nothing. Ambiguously worded questions, for instance, will trigger unreliable answers
that will cloud the results regardless of the statistical power and representativeness carried by the sample size and strategy. There is no statistical solution to the bias induced
by question wording and order. On the other hand, coding and measurement errors
can reduce the quality of the data in any sample: again, representativeness does not

33

!
control for such issues. The “garbage in, garbage out” principle applies: poorly designed studies will always yield poor results, if any.
Application 5a. Weighting a cross-national survey

Data: ESS 2008

The European Social Survey (ESS) contains a design weight variable (dweight) to account for the fact that some categories of the population are over-represented in its
sample. The table below was obtained by selecting a few observations from the study,
using the sample command with the count option to draw a random subsample of 10
observations; the list command was then used to display the country of residence,
gender and age of each respondent in this subsample, along with the design and population weights.
. sample 10, count
(51132 observations deleted)
. list cntry gndr agea dweight pweight

cntry

gndr

agea

dweight

pweight

1.

UA

M

18

1.3027

2.15

2.

CH

F

51

1.0826

0.35

3.

CH

M

52

1.0826

0.35

4.

NL

M

24

1.0099

0.76

5.

RU

F

25

0.4878

4.82

6.

RU

F

19

2.1556

4.82

7.

CY

F

47

0.7652

0.05

8.

GB

M

54

1.0141

2.14

9.

CH

M

31

1.0835

0.35

10.

RU

M

41

1.1936

4.82

The ESS documentation describes dweight (design weight) as follows:
Several of the sample designs used by countries participating in the ESS were
not able to give all individuals in the population aged 15+ precisely the same
chance of selection. Thus, for instance, the unweighted samples in some countries over- or under-represent people in certain types of address or household,
such as those in larger households. The design weight corrects for these slightly
different probabilities of selection, thereby making the sample more representative of a ‘true’ sample of individuals aged 15+ in each country.
By looking at the subsample listed above, you can spot two observations for which the
dweight variable is inferior to 1: both are females who were drawn from households
that are over-represented by the sampling strategy used by the ESS, and are therefore
assigned design weights under 1. Conversely, the other Russian female, aged 19, was
34

!
drawn from a very under-represented household, and her assigned design weight is
therefore above 2. These weights, when used with the [weight] operator or svyset
command, ensure that these observations are given more or less importance when using frequencies and other aspects of the data, as to compensate for their under- or
over-representation in the ESS sample in comparison to the actual population from
which they were drawn.
The ESS documentation describes pweight (population weight) as follows:
This weight corrects for the fact that most countries taking part in the ESS have
very similar sample sizes, no matter how large or small their population. Without weighting, any figures combining two or more country’s data would be incorrect, over-representing smaller countries at the expense of larger ones. So
the Population size weight makes an adjustment to ensure that each country is
represented in proportion to its population size.
By looking again at the data above, you can indeed observe that the two respondents
from Ukraine and Britain, two countries with large populations, have population
weights around 2, and that the three respondents from Russia have an even higher
population weight, whereas small countries like Cyprus or Switzerland have much
smaller values. This makes sure that, when calculating the frequencies of a variable over
several countries (such as the percentage of right-wing voters in Europe), the actual
population size of each country is taken into account.
In conclusion, to weight the data at the European level, we need to account for both
design and population weights. Design weights correct for the over- and underrepresentation of some socio-demographic groups, and population weights make sure
that each national population accounts for its fraction of the overall European population.
This is done with the svyset command by creating a multiplication of both weights and
using it as a probability weight:
. gen wgt=dweight*pweight
. svyset [pw=wgt]
pweight: wgt
VCE: linearized
Single unit: missing
Strata 1:
SU 1:
FPC 1:

While population and design weights are pretty straightforward in this example, surveys can reach high levels of complexity when researchers try to capture multistage
contexts by sampling from several strata and clusters of the target population. For example, large demographic surveys will often sample cities, and then sample neighbour35

!
hoods within them, and then sample households within them, and finally sample adults
within them.
Application 5b. Weighting a multistage probability sample

Data: NHIS 2009

The National Health Interview Survey (NHIS) is a good example of “a complex, multistage probability sample that incorporates stratification, clustering, and oversampling of
some subpopulations” for some of its available years of data. It would take too much
space to document the study fully, but the most basic weight, perweight, provides a
good example of how weights are constructed:
This weight should be used for analyses at the person level, for variables in
which information was collected on all persons. [The weight] represents the inverse probability of selection into the sample, adjusted for non-response with
post-stratification adjustments for age, race/ethnicity, and sex using the [U.S.]
Census Bureau's population control totals. For each year, the sum of these
weights is equal to that year’s civilian, non-institutionalized U.S. population.
More documentation from the NHIS then introduces strata, which “represents the impact of the sample design stratification on the estimates of variance and standard errors,” and psu, which “represents the impact of the sample design clustering on the
estimates of variance and standard errors.” Both parameters would require a course on
survey design to be fully explained. In the meantime, they can be passed to the svyset
along with perweight, the sampling weight:
. svyset psu [pw=perweight], strata(strata)
pweight: perweight
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1:

5.3.

Variables

A variable is any measurement that can be described using more than one numeric
value. The value should hence vary across observations to make up for a variable.
Each variable is defined by a range of possible values. At the most basic level, some variables are considered quantitative because their values can be ordered meaningfully into
levels, and some variables are considered qualitative because there is no substantively
significant ordering of their values. This distinction is rather imprecise and relatively misleading, which is why we will use a more advanced classification of variables below.
The level of measurement used by each variable in your dataset is the very first thing
that you need to understand about the data before analysing it:
36

!
−

A nominal scale qualifies a variable that was measured using discrete categories
that cannot be objectively ordered. Examples of nominal variables are religious
beliefs and legal systems: there is no objective ordering of “Jewish” and “Muslim”, and “English Common Law” is a discrete category from “French Commercial Code”.
A specific nominal scale uses dichotomous categories, which result in binary
variables that can take only 0 or 1 as values. Examples of binary variables are
sex and democracy: an individual person is either female (1) or not (0), and a
political regime is either democratic (1) or not (0). The discrete values denote a
nominal difference, not an objective order.

−

An ordinal scale qualifies a variable that was measured using categories that can
be ordered regardless of their distance. Examples of ordinal variables are educational attainment and internal conflict: ‘primary school’, ‘secondary school’ and
‘university degree’ can be ordered, just like ‘low’, ‘medium’ and ‘high’ internal
conflict.
Ordinal scales do not reflect meaningful distances between their categories. For
example, the difference in educational attainment between ‘primary school’ and
‘secondary school’ is not equal to the difference in educational attainment between ‘secondary school’ and ‘university degree’. The interval between the categories is variable.

−

An interval scale qualifies a variable that was measured using categories that
can be ordered at equal distance. Examples of interval variables are age groups
and approximate indexes: the same distance exists between the “15–19”, “20–
24” and “25–29” age groups, as well as between each category of Transparency International’s Corruption Perceptions Index, which ranges from 0 (highly
corrupt) to 10 (highly clean).
Interval variables do not have an absolute zero, insofar as the first level of the
interval is relative and does not designate a meaningful zero point. For example,
being in the lowest age group does not literally signify being 0-year-old, just as
being in the highest category of the Corruption Perceptions Index does not indicate that the level of corruption is 0%.

−

A ratio scale qualifies a variable that was measured against a numeric scale with
an absolute zero point. Examples of ratio variables are income and inflation: either of them can take 0 as a substantive value, indicating the absence of income
and a 0% inflation rate respectively. Variables of this type might be thought of
as ‘purely’ continuous.

The level of measurement is a determinant aspect in statistical models, as it will determine how to describe, analyse and interpret each variable. The type of the dependent
variable is particularly important, and a simpler classification applies:

37

!
−

Continuous variables hold values for which we can calculate counts or ratios.
Examples of continuous variables are number of children and economic growth:
an individual can have any number of children, and a state can experience any
percentage of economic growth. In both cases, we can meaningfully compare
the values across observations.
Some distinctions apply within continuous data: count data holds positive integer values, as applies to the number of children, since it is impossible to have
‘7.5 children’ or ‘–3 children’ but only { 1, 2, 3… n } children. Continuous variables like economic growth can take virtually any value, from – ∞ to + ∞, even
though they empirically exist in a more restricted range.

−

Categorical variables hold values for which we can only observe discrete categories. In statistical modelling, however, any categorical variable with an ordinal
or interval scale can be treated as pseudo-continuous, and the categorical classification will finally apply only to nominal variables. This course will often refer to
continuous data in this looser sense to include ordinal and interval variables.
The dependent variable in your research project should ideally be ‘purely’ continuous (ratio), but ordinal, interval and count variables are also possible candidates, since linear regression will also function for these types of data. More advanced models exist to better handle categorical data, but they are beyond the
scope of this course.

Further instructions apply to variable manipulation, as you will often be required to
modify the variables in your dataset. These are described in Section 8, but the “garbage
in, garbage out” principle still applies: poorly designed studies cannot be rescued by
good data manipulation.

5.4.

Values

The “number-crunching” aspect of quantitative research methods is due to the fact
that all the information considered by the analyst will come in the form of numbers rather than text (called “strings” in computer environments). Numeric values in a dataset
can point to three different kinds of data:
−

Continuous data are stored in numeric values, using integer and float formats.
The unit of measurement for the values, such as years for a variable describing
age or percentage points for a variable describing gross domestic product, are
not stored with the values but are often indicated in the variable label.
When necessary, as with cross-tabulations (Section 10), continuous data can
easily be turned into categorical data using the recode command. Variables such
as age or income, for example, can be better crosstabulated in the form of age
or income groups. When performing linear regression (Section 11), however,
continuous data are more appropriate.
38

!
−

Categorical data are also stored in numeric values, to which labels are assigned.
For example, a possible variable for gender will take the value 1 to for males
and 2 for females—although a better encoding would consist in using a binary
variable called female that would code 0 for males and 1 for females.
The recode command allows creating new variables by assigning new value labels to the data, based on existing ones. For example, if your data contains a
variable measured on an ordinal scale of ten categories from 1 ‘Strongly agree’
to 10 ‘Strongly disagree’, this scale can be recoded into a more simple scale of
five categories, or even into a binary variable.

−

Missing data are observations for which the variable takes no value due to an
issue that arose at the level of data collection, such as respondents being unable
(or refusing) to answer, or insufficient information to measure the value. Stata
identifies missing data with the “.” character.
When missing data are not coded by “.” but by numeric values, such as “-1” or
“999” for variables that cannot take these values, you will have to use the
replace command in Stata to change this coding to “.” coding, as illustrated in
Section 8.2.
Missing values will actually require a lot of attention at all points of your analysis, in order to avoid all sorts of calculation and interpretation errors when looking at frequencies and crosstabulations. Furthermore, missing values will constrain the number of observations available for correlation and regression analysis.

As suggested by this description of values, a very important part of quantitative analysis
consists in learning about the exact coding of the data in order to better manipulate
variables later on. The practical aspect of that task (modifying values and labels) is covered in more detail in Section 8.
The substantive aspect of that task relies entirely on the analyst. Reading from the dataset codebook is essential to understand how the values of each variable were obtained. For example, issues of measurement and reliability will inevitably exist with aggregate indices, self-reported data and psychometric scales.

Data: ESS 2008

The third-party fre command displays frequencies for a given variable in a better way
than the built-in tab command does. After installing the command, we look at attitudes
towards immigration from outside Europe in a sample of European respondents:

39

!
. fre impcntr [aw=dweight*pweight]
impcntr

Valid

Allow many/few immigrants from poorer countries outside Europe

1

Allow many to come and

Freq.

Percent

Valid

Cum.

6090.343

11.91

12.68

12.68

live here
2

Allow some

16460.09

32.19

34.26

46.94

3

Allow a few

15799.46

30.89

32.89

79.83

4

Allow none

9688.713

18.94

20.17

100.00

48038.61

93.93

100.00

Missing .a

140.6597

0.28

.b

2938.206

5.75

.c

24.52225

0.05

Total

3103.388

6.07

51142

100.00

Total

Total

−

The observations are weighted, which explains why the frequencies are not integers but still sum up to N = 51142.
We used the Stata [aw] suffix, which allows the use of weights like the design
and population weights in the ESS (Application 5a). Since we are looking at the
whole sample of European respondents, we used both weights, as recommended in the ESS documentation.

−

The variable under examination is an ordinal one, with four categories that can
be ordered by their degree of tolerance towards immigrants.
If we wanted to isolate the last categories, ‘Allow few/none’, in order to focus
on respondents who are most resilient to immigration, we could recode the variable as a binary one, using a dichotomous separation between ‘Allow
some/many’ and ‘Allow few/none’.

−

There are missing values, coded as .a, .b and .c. These are variations of the “.”
Stata format for missing data (examined again in Section 8).
The difference in letters is used to code for different types of missing data: as
the ESS documentation explains, .a codes for respondents who refused to answer, .b codes for respondents who did not know what to answer (often called
‘DK’ or ‘DNK’), and .c codes for respondents who did not answer (‘NA’).
Detailed missing values can hint at why there is missing data in the first place:
here, most of the 6% missing data come from respondents who declared not
having, or not being able to form, an opinion on the topic, rather than from respondents refusing to answer, which is common when the question touches
upon a topic affected by desirability bias (i.e. when some answers are more positively or negatively connoted than others).
*
40

!
A serious issue in scientific inquiry is overreliance on a limited number of data sources
and methods of study. If you are willing to spend some time exploring around, then
quantitative analysis will expand your abilities on both counts. Your skills with data and
methods are not just part of your academic curriculum: if you care enough to maintain
your level of knowledge in that area, they will stay with you all your life. My experience
with these skills shows that they have both personal and professional value.
The first step in acquiring those skills consists in training yourself to work with quantitative data. As with most activities, there is no substitute for training: your familiarity
with quantitative data and methods primarily reflect the time you spent on them. With
a few key terms of interest in mind, you should therefore start exploring data as soon
Prior to opening the datasets, take a look at their documentation files. Do not aim at
understanding every single aspect of the documentation: focus on survey design and
sampling, which should be fully documented in the data codebook. The next sections
will then guide you through data exploration and management: Section 6 explains how
to explore a dataset, Section 7 explains how to prepare it for analysis, and Section 8
covers further data management operations with variables.

41

!

6. Exploration
Exploring quantitative data requires either assembling your own data (an option not
covered in this course) or locating some pre-assembled datasets online. The diffusion of
quantitative data has made tremendous progress in the past decade, and an amazing
range of – often underused – datasets are available online.
The socio-political and technological determinants of the current ‘data deluge’, as an
article in The Economist once put it, are outside of the scope of this guide, but a lot of
online commentary and analysis exists on the ‘data revolution’ in science, journalism
and government (check Victoria Stodden’s work first).

6.1.

Access

The course makes extensive of a few recommended datasets that were selected based
on several criteria ranging from topical interest to simplicity and quality. For your own
project, you will be first offered to work with recent versions of the European Social
Survey (ESS) and Quality of Government (QOG) data.
If you plan to go beyond these recommended sources, turn to the course material to
learn more on data repositories and data libraries. High-quality data is still rare, and
good sources to look for such data are the ICPSR and CESSDA repositories, listed on
the course website.
Important aspects of data retrieval include the following:
−

Always download the documentation for your data. Professional-quality datasets come with extensive codebooks that help with understanding the data
structure, as well as with other notes on the data itself.

−

Never rely on any source to preserve the data for you. Even if the integrity of
data repositories is improving, always keep a pristine (intact) copy of the (original) datasets that you use in your personal archives.

−

Full acknowledgment of the source is an ethical counterpart. In order to make
a legitimate use of datasets for either research or teaching purposes, reference
the source in full and follow all related instructions.

6.2.

Browsing

The simplest way to quickly explore your data is to open the Data Editor after you
loaded your dataset: type in browse (or edit if you plan to modify the data) in the
Command window, and the Data Editor window will open. Alternatively, use the Ctrl-8
(Windows) or Cmmd-8 (Macintosh) keyboard shortcut.
42

!
The variables contained in Stata datasets can be explored with the codebook command.
That same command will also return information about the dataset itself, as will the
notes command if the dataset comes with Stata data notes. It is not, however, very
practical to explore data with these tools.
−

Always start with the describe command, which can be abbreviated to its single-letter shorthand, ‘d’. Simply typing d into Stata will return the list of variables, and typing d followed by a list of one or many variable names will describe
only these. High-quality datasets should always come with intelligent variable
names and labels.

−

When the list of variables is too long to be inspected in full, use the lookfor
command to search for keywords in the names and labels of variables. The
keywords need not be complete terms: ‘immig’ will work fine, for instance. You
can use any number of keywords with lookfor: be aware of synonyms and try
different possibilities.

−

Finally, learn how to use the rename command (shorthand: ren) as soon as possible. When you identify a variable of interest, there is a fair probability that its
name will be some kind of strange acronym or something even less comprehensible, like ‘v241’ or ‘s009’. Renaming the variable will help solving that issue.

In some situations, you might also want to use the sort and order commands, respectively to sort the observations according to the values of one particular variable, or to
reorder the variables in the dataset. Turn to the Stata help pages for the full documentation of these commands.
Application 6a. Locating and renaming variables

Data: ESS 2008

‘Microdata’ is a term that generally refers to data based on individual respondents. At
that level of analysis, common demographic and socioeconomic variables include age,
gender, income and education.
We used the lookfor command to identify variables that refer to income and earnings,
which we did not type in full to allow for all ‘earn-‘ terms to show up in the results:

43

!
. lookfor income earn
storage

display

value

variable name

type

format

label

variable label

gincdif

byte

%1.0f

gincdif

Government should reduce

hincsrc

byte

%2.0f

hincsrc

Main source of household income

hincsrca

byte

%2.0f

hincsrca

Main source of household income

hinctnta

byte

%2.0f

hinctnta

Household's total net income, all

hincfel

byte

%1.0f

hincfel

differences in income levels

sources

Once the variable of interest has been identified, we use the rename (shorthand ren)
command to give it a more explicit name. When successfully run, the command does
not send back any output:
. ren hinctnta income

After doing the same for other variables and writing all ren commands to our do-file,
we obtain a list of variables that can be described as follows:
. d age gender edu income
storage

display

value

variable name

type

format

label

variable label

age

int

%3.0f

agea

Age of respondent, calculated

gender

byte

%1.0f

gndr

Gender

edu

byte

%2.0f

eduyrs

Years of full-time education

income

byte

%2.0f

hinctnta

Household's total net income, all

completed
sources

6.3.

Selections

At several points of your analysis, you will want to apply commands to selected parts of
your data, as in the case where you might want to summarize a variable only for a selected category of subjects in a survey. In that case, you will be using the if conditional
statement.
The if statement works by adding a specific condition written with mathematical signs
to indicate equality or inequality, summarised below:
==

>

equal to

44

greater than

!
!=

not equal to

<

mi() missing
!mi() not missing

less than

>=

greater than or equal to

<=

less than or equal to

Conditions can be combined to each other by using two logical operators:
&

and

|

or

The use of conditions is pretty intuitive, except for more elaborate patterns that use
brackets to create sophisticated conditions that we should not need for this course. It is
important to be able to use conditions, since they apply to almost all operations that we
will use for analysis.
Application 6b. Counting observations

Data: ESS 2008

The count command simply counts observations in a dataset, based on a given condition. Without any condition, it just counts all observations:
. count
51142

If we are interested in knowing how many observations the dataset includes for respondents strictly over 64 years-old, we type the following, which uses the renamed
variables from Application 6a:
. count if age > 64
10949

A slight issue here is that Stata counts missing values encoded as “.” as positive infinity,
which means that the above command included the missing values of the variable in its
count of observations over value 64 for the “age” variable.
The following command recounts respondents strictly over 64 years old without these,
by combining two conditions with the conjunctive & (“and”) operator:
. count if age > 64 & !mi(age)
10803

The last command literally translates as: count observations for which the age variable
takes a value strictly over 64 and is non-missing. Missing values often appear in different forms than “.”, which is another issue that we will learn to solve in Section 8.4.
Application 6c. Selecting observations

Data: ESS 2008

Turning to gender and education, we want to look at the average schooling years of
young females. We start with the list command to show a fraction of the data, and
45

!
then compute summary statistics with the su command to learn the basics about the
distribution of the edu variable:
. list age gender edu in 1/10, clean

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

age
36
26
69
77
27
32
19
28
49
57

gender
M
F
M
F
M
F
F
F
M
M

edu
18
15
18
15
13
12
13
17
16
16

. su edu
Variable

Obs

Mean

edu

50682

11.96253

Std. Dev.
4.225673

Min

Max

0

50

The list… in… command is purely exploratory: it allows you to take a glance at a few
lines of data in the same way as browse or edit would let you do from the Stata Data
Browser/Editor window. The clean option is just a cosmetic fix.
If we are interested in the average schooling years for male and female adult respondents below 65 years-old, we form a conjunctive conditional statement for age and use
the bysort command to separate respondents by gender:
. bysort gender: su edu if age >= 18 & age < 65

-> gender = M
Variable

Obs

Mean

edu

17890

12.73181

Variable

Obs

Mean

edu

20461

12.62538

Variable

Obs

Mean

edu

4

12.25

Std. Dev.
3.836265

Min

Max

0

48

Min

Max

0

40

Min

Max

11

14

-> gender = F
Std. Dev.
4.011698

-> gender = .a
Std. Dev.
1.5

46

!
Note that we did not use the & !mi(age) conditional statement, because Stata will not
include missing values in the (18-65] interval as it would in the (65; +∞] interval formed
in the previous command.
The use of conditionals is very common at all stages of analysis. The ‘or’ logical statement also becomes useful when selecting observations based on the values taken by a
categorical variable, which can be explored with the fre command:
. fre cntry, rows(9)
cntry

Valid

Country

BE
BG
CH
CY
:
SI
SK
TR
UA
Total

Freq.

Percent

Valid

Cum.

1760
2230
1819
1215
:
1286
1810
2416
1845
51142

3.44
4.36
3.56
2.38
:
2.51
3.54
4.72
3.61
100.00

3.44
4.36
3.56
2.38
:
2.51
3.54
4.72
3.61
100.00

3.44
7.80
11.36
13.73
:
88.13
91.67
96.39
100.00

In the command above, we tabulated the countries (cntry) of residence of the respondents to the ESS. The variable cntry is a nominal variable encoded as text: no numeric
value exists for the labels “BE” (Belgium, for which the dataset holds a total number of
observations of N = 1760 respondents), “BG” (Bulgaria, N = 2230), … “TR” (Turkey, N
= 2416) and “UA” (Ukraine, N = 1845). This means that we will need to use strings
(text) in “double quotes” to pass a command to them with Stata, as in this count of
Greek respondents:
. count if cntry=="GR"
2072

If our analysis were to focus on the average level of male and female education in
Greece and Cyprus, we would run a command using a disjunctive | (“or”) operator to
include respondents from both respondents:

47

!
. bysort gender: su edu if cntry=="GR" | cntry=="CY"

-> gender = M
Variable

Obs

Mean

edu

1544

11.83679

Variable

Obs

Mean

edu

1723

11.37725

Variable

Obs

Mean

edu

0

Std. Dev.
3.988876

Min

Max

0

24

Min

Max

0

24

Min

Max

-> gender = F
Std. Dev.
3.925022

-> gender = .a
Std. Dev.

Of course, if the whole analysis is focused on Greece and Cyprus, we would first think
of subsetting the data to these countries only. This matter is covered in Section 7, along
with other instructions about dataset manipulation. Issues to do with variable coding,
such as encoding string variables or manipulating labels, are covered in Section 8. Finally, the su and fre commands are explained again in detail when covering distributions
and frequencies in in Section 9.

48

!

7. Datasets
Stata datasets are characterised both by the DTA dataset file format and by the way
that the data are arranged within the file:
−

File format. Your dataset should come as a single file in Stata .dta format. If
your data come in any other format, you will have to convert it (Section 7.1). If
your data come in more than one file, you will need to merge all components
into one file (Section 7.2).
Many issues can appear during dataset conversion, such as text conversion errors with accents or other characters, or mismatches in merged data; these issues will require manual fixing or using advanced editing techniques that are
beyond the scope of this course.

−

Data format. The rows of your dataset should hold your units of observation
(most commonly of which, individuals or states) and its columns should hold
your variables (such as sex or country name). To quickly check that structure,
open the Data Editor by typing browse.
If your data are formatted as time series, with variables in rows and values for
each time unit (such as years) in columns, you will need to use the reshape
command (Section 7.3). This often happens with country-level data measured
over several years.
Finally, for the purposes of this course, you are required to work on crosssectional data that were collected at only one point in time. If your dataset contains time series or any form of longitudinal study, you will need to subset it to a
single time period (Section 7.4).

Important: data management is time-consuming, error-prone and complex. All the operations described in this section, at the exception of subsetting, will draw a lot of energy from you. If your dataset for this course is not ready in very short delays (i.e.
around two full days of work), do not engage into longer operations that might eventually fail and leave you without usable data.
If you get stuck, start by checking the UCLA Stat Computing advice page for guidance:
http://www.ats.ucla.edu/stat/stata/topics/data_management.htm.

7.1.

Conversion

Most simply, some datasets come in compressed archives like ZIP files, which you will
need to decompress while making sure that no error occurred during decompression.
Free decompression software exists for all operating systems.

49

!
Do not try to use ASCII data for which you need to use the infix and dictionary commands, which are too time-consuming for the purpose of this course. Ask us for help if
you really need to bypass this recommendation.
If your files come in SPSS or SAS format, or in any other format for use in another statistical package, you will need to use a conversion utility to convert the data. We should
be able to use Stat/Transfer to help with that process.
Check the Stata FAQ from UCLA Stat Computing for guidance on dataset conversion:
http://www.ats.ucla.edu/stat/stata/faq/default.htm. Again, several encoding issues can
occur during dataset conversion, and you will be required to perform a thorough check
of the result to clear any possible mistake.
If your data come in a format supported by Microsoft Excel, you should export your
data to CSV format and import it into Stata with the insheet command; see
http://dss.princeton.edu/online_help/stats_packages/stata/excel2stata.htm.
Briefly put, your data should all fit on one single Excel spreadsheet and contain nothing
else than the data, except for the header row on the first line of your file. The header
must contain the variable names, which should have short names and must not start
with an underscore (_) or a number.
Your numeric data must not contain formulae and must be formatted as plain numbers—do not use any other format, as it might cause issues when importing into Stata.
Furthermore, the numeric values should not use commas (,) as they can disrupt the CSV
format.
Finally, make sure that the missing observations are represented by blank cells in your
data. To do this, you must find and replace all characters that are often used to mark
missing observations (such as “NA”) with either blank space or the standard “.” symbol
for missing values in Stata.

7.2.

Merging

You can merge your data in either Stata or Microsoft Excel. Units of analysis are naturally expected to be identical in both datasets.
It is essential that your observations match identically when merging files: for instance, when merging two datasets with country-level data, you will have to make sure
that the countries are present in both datasets under identical names.
Merging Stata datasets uses the Stata merge command, a very powerful tool for merging and matching your data. The command is very well documented in this handy tutorial by Roy Mill:
http://stataproject.blogspot.com/2007/12/combine-multiple-datasets-into-one.html

50

!

7.3.

Reshaping

Your dataset should hold your units of observation in rows, and your variables in columns. If that format is not respected, you will need to reshape your dataset in order to
fit that format.
In this example, the data are formatted with year values in columns,
while the units of observation are
displayed in rows.
This format often applies to time series for country-level data. For example, this format applies to OECD data, as shown here. The data were
provided for Microsoft Excel.
Solving this issue requires to run a
series of steps called “reshaping”.
To reshape data for one variable, follow the following steps carefully:
−

Start by making sure that your data have been properly prepared: all variables
must be numeric, and missing observations should be encoded as such.

−

Prepare your data as a CSV file. All variables should be labelled on the first line,
and the rest of the file should contain only data (remove any other text or information).

−

Add a letter in front of each year. Select your first line, which contains the variable names, and then use the ‘Edit > Replace…’ menu item in Excel to add a ‘y’
in front of each year. For instance, if your data were collected for years 1960–
2010, find ‘19’ and replace by ‘y19’, and find ‘20’ and replace by ‘y20’.

−

Import using the insheet command, and check your data in the Stata data editor. The example below shows the result with only one variable (health expenditure per capita in some OECD countries).

−

Create a unique id value for each unit of observation (in this case, OECD countries) by typing gen id = _n and then order id. This will add an additional variable to your data.
51

!

−

To reshape your data, type reshape long, y i(id) j(year) to reshape your data
columns that start with a ‘y’, for all rows identified by the id variable, into a different data format, called “long”, where the years will have been fit into a year
variable.
. reshape long y, i(id) j(year)
(note: j = 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975)
Data

wide

->

long

Number of obs.
16
Number of variables
18
j variable (16 values)
xij variables:
y1960 y1961 ... y1975

->
->
->

256
4
year

->

y

Your dataset will have been converted from its initial “wide” format, with values for
each year in columns, to a “long” format where the values for each year appear on
separate rows.
Once you are in “long” mode, you can rename the variable that you were working on
and drop the observations that you do not need (remember that you are working to
obtain cross-sectional data and not time series).
ren y hexp
la var hexp "Health expenditure per capita"
drop if year != 1975

52

!

The screenshot on the left shows your data in “long” mode; the screenshot on the right
shows the same data after executing the commands described above.
If you are trying to reshape a dataset that is formatted in “wide” mode with more than
one variable, more steps are required to separate the variables, as described in this tutorial: http://dss.princeton.edu/training/DataPrep101.pdf (locate the “Reshape” slides
#3–5).
Note that data reshaping will take a lot of time to achieve if you are merging and reshaping data over a large number of files. Do not try to merge more than a handful of
datasets, as any other operation would require more time than this course can reasonably require from you.
Additional options in the reshape command allow to reshape data where the suffix is
not numeric: type help reshape (or the shorthand version, h reshape) for additional

7.4.

Subsetting

Subsetting your data is a way to analyse only a selected subsample of the data. This can
happen principally for two reasons:
−

This course focuses on cross-sectional data and therefore requires that you
subset only one time period if your dataset spans over more than one period of
survey data.

−

If you want to analyse only one segment of the data, such as only one country
in a Europe-wide dataset or one age group in a population-wide dataset, you
should subset the data to it.

Application 7a. Subsetting to cross-sectional data

Data: NHIS 2009

In order to calculate and analyse the Body Mass Index (BMI) of American respondents,
we use recent data from the National Health Interview Series (study: NHIS). We examine the structure of the dataset by inspecting the year variable with the fre command
(with trivial formatting options):

53

!
. fre year, rows(5) nol
year

Valid

Freq.

Percent

Valid

Cum.

2000

28712

11.41

11.41

11.41

2001

29459

11.71

11.71

23.12

:

:

:

:

2008

18913

7.52

7.52

90.34

2009

24291

9.66

9.66

100.00

Total

251589

100.00

100.00

:

At that stage, we need to select which year we want to work on. An intuitive choice is
the most recent year, if it holds a sufficient number of observations. In this example,
survey year 2009 forms a large subsample of observations, though not the largest.
Given that our dependent variable, the Body Mass Index, will require the height and
weight of each respondent to be calculated, we verify the total number of observations
for the height and weight variables among respondents who were interviewed during
survey year 2009:
. su height weight if year==2009
Variable

Obs

Mean

Std. Dev.

Min

Max

height

24291

66.61652

3.865753

59

76

weight

24291

172.5895

37.12779

100

285

The results indicate that survey year 2009 holds a sufficiently large number of observations, and that the variables of interest are not missing for that year. We thus subset
the dataset to that year with the keep command and the if operator set to select observations where the year is equal (“==”) to 2009:
. keep if year==2009
(227298 observations deleted)

We could also have used the drop command to suppress all years that are different
(“!=”)from 2009, although that writing is somehow less intuitive:
. drop if year!=2009
(227298 observations deleted)

An additional check of the year variable shows that subsetting was successful:

54

!
. fre year, nol
year

Valid

2009

Freq.

Percent

Valid

Cum.

24291

100.00

100.00

100.00

In this example, the total number of observations which we study in our analysis is thus
N = 24,291, rather than 215,589 for all survey years. The keep and drop commands
also apply to variables, and we could continue here by subsetting the dataset to only a
handful of variables which we plan to use in the analysis, but this course will not require
you to do so.

55

!

8. Variables
The basic anatomy of a variable consists of its name and values, to which we can add
labels in order to provide short descriptions of what the variable measures. Data, and
categorical data especially, are rarely understandable without labels.
Your primary source of information for variable codes is always the dataset codebook,
but for practical reasons, some of that information is also stored in your dataset, as you
might have to modify it before running your statistical analysis.
This chapter shows:
−

How to inspect variables (Section 8.1). Variable inspection is necessary to learn
how the variable is coded, in order to select appropriate commands for its manipulation and analysis.

−

How to show and set variable labels (Section 8.2). Labels are short text descriptions attached to your variables and to their numeric values. They increase the

−

How to recode variables into different categories (Section 8.3). Recoding allows you to select the categories in which to manipulate your variables, which is
helpful when you want to analyse particular groups of observations.

−

How to solve encoding issues (Section 8.4). Encoding applies to missing data,
which should be coded as “.”, and to variables that hold text, i.e. “strings”,
which are better manipulated when they are encoded with numeric values.

8.1.

Inspection

The following commands allow inspecting the names, values and labels of variables:
−

codebook is the most exhaustive command if you need to understand your data
structure in depth.
Stata also offers a note function that makes it possible to write an annotated
codebook within a Stata dataset, but this function is idiosyncratic to the DTA
format and limits interoperability.

−

describe (shorthand d) is generally used to describe several variables at once, as
when opening a dataset for the first time. It provides three kinds of information:
−

The variable name is how the variable is named in your dataset. It is the
name that you pass to Stata through your commands.

−

The variable label is a short text description of the variable (gender). It usually includes the unit of measurement used by the variable when relevant.

56

!
−

−

The value label is the name of a distinct element of data structure that assigns text labels to the numeric values of the variable. It will often be the
case that value labels will have the same name as your variable.

label list (shorthand la li) is one of many label commands to show and edit the
labels featured in a dataset.

Application 8a. Inspecting a categorical variable

Data: NHIS 2009

The example below shows the encoding of a variable that codes for respondents’ sex:
. d sex
storage

display

value

variable name

type

format

label

variable label

sex

byte

%8.0g

sex_lbl

Sex

To understand how the male and female respondents are coded by the sex variable,
use the label list command (shorthand la li) to display the value label sex_lbl:
. la li sex_lbl
sex_lbl:
1 Male
2 Female

The sex_lbl value label is a separate entity from the sex variable itself: it can be applied
to any other variable where it is suitable to have males coded as 1 and females as 2, as
with variables that code for the gender of other persons in the respondent’s household.
When you need to access all information above in a single command, the codebook
command provides detailed output on names, values and labels, as well as more details
on missing data:
. codebook sex

sex

Sex

type:

numeric (byte)

label:

sex_lbl

range:

[1,2]

unique values:
tabulation:

units:

2

missing .:

Freq.

Numeric

10978

1

Male

13313

2

Female

57

Label

1
0/24291

!
We later explore a way to code this variable as a dummy, which is actually smarter. On
its own, the mean value of the sex variable is unreadable, but if we code sex as 0 for
males and 1 for females and make that coding explicit by naming the variable female,
then its mean explicitly shows the percentage of women in the sample.
Application 8b. Inspecting a continuous variable

Data: NHIS 2009

For continuous data, the d and codebook commands can show the same information as
for categorical data. The codebook command used with the compact (shorthand c) option displays the variable name and label along its total number of observations, mean
and range. The example below shows its results for age, height, weight and health:
. codebook age height weight health, c
Variable

Obs Unique

Mean

Min

Max

Label

age

24291

67

46.81392

18

84

Age

height

24291

18

66.61652

59

76

Height in inches without shoes

weight

24291

186

172.5895

100

285

health

24284

5

2.288709

1

5

Weight in pounds without clothes...
Health status

This table presents descriptive statistics in an efficient way that resembles the table of
summary statistics that we will produce at the end of Section 9. Note however that the
health variable is not a truly continuous variable but an ordinal one that codes the selfreported health of respondents on a subjective scale from 1 (excellent) to 5 (poor).

8.2.

Labels

Section 6.2 already introduced the rename (shorthand ren) command to modify variables names. We now turn to modifying variable and value labels:
−

All your variables should be assigned at least one label, the variable label, which
is already set in most datasets.
When creating variables, label them with the label variable (shorthand la var)
command. Include, if applicable, their unit of measurement.

−

A second form of label then applies to the values of categorical data, as when 1
codes for “Strongly Agree”, 2 codes for “Agree” and so on.
These labels are modifiable with the label define (shorthand la def) and label
values (shorthand la val) commands.

Application 8c. Labelling a dummy variable

Data: NHIS 2009

In this example, we create a variable for the respondents’ Body Mass Index (BMI) and
examine it under three different forms. We first create the variable from the respondents’ weight and height measurements, respectively in pounds and inches:
58

!
. gen bmi=weight*703/height^2
. su bmi
Variable

Obs

Mean

Std. Dev.

bmi

24291

27.27

5.134197

Min

Max

15.20329

50.48837

Immediately after creating the variable and checking its results, we label the variable
with the signification of the ‘BMI’ acronym to help ourselves and others make sense of
the data at later stages of analysis:
. la var bmi "Body Mass Index"
. d bmi
storage

display

value

variable name

type

format

label

bmi

float

%9.0g

variable label
Body Mass Index

Given that Body Mass Index is a continuous measurement that comes in its own metric,
the bmi variable does not require any additional label. Let’s assume, however, that we
are further interested in identifying respondents with a BMI of 30+, which designates
obesity in the WHO classification of BMI. To that end, we create a dummy variable for
respondents over that threshold:
. * Dummy for obesity.
. gen obese:obese = (bmi >= 30) if !mi(bmi)

The gen command created the obese variable and assigned the obese value label to it.
The logical test (bmi >= 30) returned 1 when that statement was true, 0 if false. Observations for which the bmi variable was missing were excluded from the operation and
therefore preserved as missing.
The result is a dichotomous variable where 1 codes for obesity and 0 otherwise. When
we summarize the obese dummy as a continuous variable with the su command, its
mean provides the percentage of obese respondents in the sample:
. su obese
Variable

Obs

Mean

obese

24291

.2626076

Std. Dev.
.44006

Min

Max

0

1

In the next commands, we label the obese variable and define the obese value label in
reference to its two possible values (1 if obese, 0 if not):
59

!
. la var obese "Obesity (BMI 30+)"
. la def obese 0 "Not obese" 1 "Obese"

The variable and value labels show in the categorical display of the dummy through the
fre command. The variable is fully specified:
obese

Valid

Obesity (BMI 30+)

0 Not obese
1 Obese
Total

Freq.

Percent

Valid

Cum.

17912

73.74

73.74

73.74

6379

26.26

26.26

100.00

24291

100.00

100.00

The obese value label can also be assigned to other variables with the la val command,
if you are interested in coding for obesity in other persons than the respondent.

8.3.

Recoding

Recoding is a way of producing a new variable out of an existing one, by collapsing
values of the original variable into different categories. Most of the variables in your
dataset are probably ready for use in their original metric, but in some cases, you might
want to recode your variable using one of the following techniques:
−

The recode command is very handy to create groups from continuous data or to
permute values in categorical data, as shown in Application 8d. It can also create dummies, as shown in Application 8e along with the tab, gen() equivalent.

−

The gen command has three extensions – recode(), irecode() and autocode() –
that basically produce the same result as the recode command in less code and
with a bit more flexibility, as shown in Application 8f.

−

The replace command can be used to ‘hard-recode’ variables, but you would be
altering your original variables in doing so and therefore running an additional
risk of data-related error. We use the replace command for missing data only.

When you are creating groups from continuous variables, make sure that your categories do not omit any values from the original variable (exhaustiveness), and that they
do not overlap (mutual exclusiveness). For example, if you plan to recode educational
attainment, make sure that each diploma appears in only one category, and that all levels of education are represented in your new categories.
The number of categories is a substantive issue that depends on your variable. Aggregation can take the form of birth cohorts, age groups or income bands. The only intangible rules that apply are exhaustiveness and mutual exclusiveness (recode all values of
the original variable to a single category of the new variable).
60

!
Application 8d. Recoding continuous data to groups

Data: NHIS 2009

The age variable, which measures the age of respondents, can be used in its continuous
form or can be recoded into age groups for crosstabulations. To recode the age variable
into four age groups, we use the recode command and create the age4 variable with
the gen option, keeping all names and labels as concise and explicit as possible:
. * Recoding age to 4 groups.
. recode age ///
>

(18/29=1 "18-29") ///

>

(30/44=2 "30-44") ///

>

(45/64=3 "45-64") ///

>

(65/max=4 "65+"), gen(age4)

(24291 differences between age and age4)
. la var age4 "Age (4 groups)"

Never operate a transformation like the one above without checking its results. The
whole point of programming your analysis into a do-file is that you can include comments and checks throughout your work. Here, the fre command serves as a technical
verification for the operation:
. fre age4
age4

Valid

Age (4 groups)
Freq.

Percent

Valid

Cum.

1 18-29

4744

19.53

19.53

19.53

2 30-44

6715

27.64

27.64

47.17

3 45-64

8477

34.90

34.90

82.07

4 65+

4355

17.93

17.93

100.00

Total

24291

100.00

100.00

The number of age groups ultimately depends on your research design. A more finegrained categorisation might apply if your hypotheses predict strong generational or
cohort effects, or rely on specific positions in the life cycle—for instance, being part of
the generation that was young and politically active around 1968 in Western countries.
Application 8e. Recoding dummies

Data: NHIS 2009

In Application 8a, we saw that the sex variable coded 1 for males and 2 for females in
our dataset. However, we prefer to manipulate dichotomous measures in the form of
dummy variables that use sensible values of 0 and 1 in relation the variable name.
To that end, we recode the sex variable to the female dummy where 1 naturally codes
for being female and 0 for not being female, i.e. male. We could use the gen command
as we did in Application 8c, but the recode command is just as efficient here:
61

!
. * Recoding sex as a female dummy.
. recode sex (1=0 "Male") (2=1 "Female") (else=.), gen(female)
(24291 differences between sex and female)

You should check for exact concordance at that point. Crosstabulating the original and
recoded variables will work, but a quicker concordance test exists here:
. count if female != sex-1
0

Dummies are very common in statistical modelling, and Stata offers more ways to code
information into dummies. The tab command, for instance, can create dummies for
each of its categories with the gen option, as here with marital status:
. * Recoding marital status as dummies.
. tab marstat, gen(married)
Legal marital status

Freq.

Percent

Cum.

Married

11,221

46.19

46.19

Widowed

1,874

7.71

53.91

Divorced

3,696

15.22

69.12

Separated

906

3.73

72.85

Never married

6,542

26.93

99.79

Unknown marital status

52

0.21

100.00

Total

24,291

100.00

In this example, the marstat variable has been used to create six dummies, all named
with the married prefix, and each coding for one category of martial status. The dummies are given descriptive variable labels, as shown by the d command when used on
all married1, married2, … married6 dummies at once with the * operator:
. codebook married*, c
Variable

Obs Unique

Mean

Min

Max

Label

married1

24291

2

.4619406

0

1

marstat==Married

married2

24291

2

.0771479

0

1

marstat==Widowed

married3

24291

2

.1521551

0

1

marstat==Divorced

married4

24291

2

.0372978

0

1

marstat==Separated

married5

24291

2

.2693179

0

1

marstat==Never married

married6

24291

2

.0021407

0

1

marstat==Unknown marital status

The codebook command with the c option shows the mean value for each dummy,
which is also the frequency of its category in the data. The dummy for being divorced,
married2, hence has a mean of .15 and represents 15% of all observations.

62

!
Application 8f. Recoding bands

Data: NHIS 2009

If you need to produce more complex recodes, the recode(), irecode() and autocode()
extensions of the gen command produces results similar to recode in less code and in a
more flexible way that is particularly appropriate to recode continuous data into bands,
as with age class or income bands.
The following example uses irecode() to recode Body Mass Index as the four groups
established by its international classification:
. * Recoding BMI to 4 groups.
. gen bmi4:bmi4 = irecode(bmi, 0, 18.5, 25, 30, .)

This command creates a first category of respondents for which 0 ≤ BMI < 18.5, which
is classified as underweight, up to a fourth category for 30 ≤ BMI < ∞, which designates
obese respondents. We add variable and value labels to specify the recoded variable,
and finally proceed in checking the recoded variable with the table command:
. la def bmi4 1 "Underweight" 2 "Normal" 3 "Overweight" 4 "Obese"
. la var bmi4 "BMI classes"
. table bmi4, c(freq min bmi max bmi) f(%9.4g)

BMI classes

Freq.

min(bmi)

max(bmi)

Underweight

274

15.2

18.48

Normal

8625

18.51

25

Overweight

9013

25.01

29.99

Obese

6379

30.02

50.49

The table command is used here as a technical check comparing the categories of the
bmi4 variable to the minimum and maximum values of BMI that they respectively hold.
The format (shorthand f) option limits the number of visible floating digits.

8.4.

Encoding

Encoding issues have to do with the format of your data. Fundamentally, your dataset
is just a text file with a text encoding and delimiters for columns and rows. Within your
−

Missing data are often encoded as arbitrary numeric values. These values can
be distinctive, such as -1 for strictly positive data or 9 for ordinal data on a fivepoint scale. In other cases, multiple codes are used, as in 77 for “Refused to answer”, 88 for “Do not know” (DNK) and 99 for “No answer” (NA).

63

!
Stata requires missing data to be encoded as a dot (.). It also supports multiple
missing data formats: .a, .b, … , .z can be used to encode missing data of different kinds. Stata treats all missing data as +∞ (positive infinity).
Missing data that are not yet coded in Stata format can be addressed with the
replace command. When encoding several variables with identical coding
schemes, the mvdecode command can perform batch encodings.
−

Textual data are often encoded as chains of characters. These “strings” of text,
as they are called in programming environments, are difficult to manipulate
from Stata because they do not come with a numeric framework.
Stata requires strings to be encoded with numeric values. Only in specific circumstances is encoding text neither necessary nor particularly desirable, as
when manipulating singular information like country names.
Text data that are not yet supported by numeric values can be addressed with
the encode command. In the specific case of numbers encoded as text, the

Encoding missing data

Data: NHIS 2009

The following example shows a typical encoding issue. If analysed in its current state,
the diayrsago variable will treat values 96, 97 and 99 as valid measurements, therefore
distorting completely any analysis of the variable:
. fre diayrsago, row(10)
diayrsago

Valid

Years since first diagnosed with diabetes
Freq.

Percent

Valid

Cum.

86

0.35

0.35

0.35

0

Within past year

1

1 year

151

0.62

0.62

0.98

2

2 years

163

0.67

0.67

1.65

3

3 years

164

0.68

0.68

2.32

4

4 years

117

0.48

0.48

2.80

:

:

:

:

:

81 81 years

1

0.00

0.00

8.85

82 82 years
96 NIU

1

0.00

0.00

8.85

22111

91.03

91.03

99.88

97 Unknown-refused
99 Unknown-don't know
Total

2

0.01

0.01

99.88

28

0.12

0.12

100.00

24291

100.00

100.00

To solve this issue, we need to replace values 96, 97 and 99 with missing data codes
that are recognisable by Stata, i.e. either just “.” for all values or .a, .b and .c for each
of them if we are interested in keeping them distinct from each other. The precise
choice entirely has to do with our research design.
64

!
Assuming that we want to fix the issue in the simplest way, two solutions apply. The
first solution modifies the diayrsago variable directly, using the replace command to
substitute values over 95 with missing data:
. * Simple encoding.
. replace diayrsago=. if diayrsago > 95
(22141 real changes made, 22141 to missing)

The alternative code uses the gen command with the cond() operator to create a new
variable through a simple “if… else” statement. The diabetes variable will be equal to
the original diayrsago variable except when it is superior to 95, in which case it will replace it with missing data:
. * Alternative.
. gen diabetes = cond(diayrsago < 95, diayrsago, .)
(22141 missing values generated)

Both solutions are almost equivalent, and users might generally prefer the first one for
its simplicity. The second is actually more secure, since it does not overwrite the original
variable; however, it loses creates a new variable and therefore loses labels.
Assuming that we want to preserve a distinction between types of missing data, two
other solutions apply. The first one proceeds as before with the replace command, but
uses the .a, .b and .c missing data markers:
. * Detailed encoding.
. replace diayrsago=.a if diayrsago == 96
(22111 real changes made, 22111 to missing)
. replace diayrsago=.b if diayrsago == 97
(2 real changes made, 2 to missing)
. replace diayrsago=.c if diayrsago == 99
(28 real changes made, 28 to missing)

The alternative code is, again, more secure and this time also much quicker. It uses the
recode command to create the diabetes variable, recoding values to missing data in the
process while leaving untouched all other values by default:
. * Alternative.
. recode diayrsago (96=.a) (97=.b) (99=.c), gen(diabetes)
(22141 differences between diayrsago and diabetes)

Finally, let’s introduce a case where multiple variables are using the same scheme for
missing data, as with the ybarcare and uninsured variables below:

65

!
. fre ybarcare uninsured
ybarcare

Valid

Needed but couldn't afford medical care, past 12 months
Freq.

Percent

Valid

Cum.

1 No

21811

89.79

89.79

89.79

2 Yes

2477

10.20

10.20

99.99

3

0.01

0.01

100.00

24291

100.00

100.00

Freq.

Percent

Valid

Cum.

4510

18.57

18.57

18.57

19727

81.21

81.21

99.78
100.00

9 Unknown-don't know
Total

uninsured

Valid

Health Insurance coverage status

1 Not covered
2 Covered
9 Unknown-don't know
Total

54

0.22

0.22

24291

100.00

100.00

In this case, the mvencode and mvdecode commands are quicker than others. Correctly
encoding missing data will actually require to use the mvdecode command on both variables while passing the values to be encoded as missing through the mv option:
. * Batch encoding.
. mvdecode ybarcare uninsured, mv(9)
ybarcare: 3 missing values generated
uninsured: 54 missing values generated

Data structures can differ markedly, and encoding issues will frequently arise as soon as
you start opening datasets created in other software than Stata. Different encodings for
missing data can be solved quickly, but only if diagnosed: always spend enough time
Application 8g. Encoding strings

Data: MFSS 2006

In this example, we look at the Music File Sharing Study, which the Canadian government contracted in 2006 to study how digital content affects consumer behaviour. The
survey was documented and analysed in a paper by Birgitte Andersen and Marion
Frenz (Journal of Evolutionary Economics, 2010), and is available from the Industry
The dataset was imported into Stata using the insheet command, but substantial encoding issues plague the data at that stage. These issues can be diagnosed by inspecting the storage type of each variable, but they are more easily evident when browsing
data from the Data Editor, which you can open with the browse command:
. browse id quest s_dat prov qregn qd8 q2_1a in 1149/1159

66

!

In this screenshot, variables with values in red are simply coded as text with no numeric
value to designate them—a format also known as ‘string’, which is impractical for statistical analysis as hinted by the warning colour that Stata assigns to their columns.
Let’s start with the prov variable codes for the respondent’s province of residence. Because of the string format, we have to include double quotes around its values to designate respondents from, for example, Alberta and British Columbia:
. count if prov=="AB" | prov=="BC"
293

This quickly becomes impractical, so we use the encode command to produce a similar
variable with automatically generated numeric values and labels for each of them:
. encode prov, gen(province)
. fre province
province

Valid

PROV
Freq.

Percent

Valid

Cum.

1

AB

152

7.24

7.24

7.24

2

BC

141

6.71

6.71

13.95

3

MB

57

2.71

2.71

16.67

4

NB

54

2.57

2.57

19.24

5

NL

23

1.10

1.10

20.33

6

NS

48

2.29

2.29

22.62

7

ON

559

26.62

26.62

49.24

8

PE

6

0.29

0.29

49.52

9

QC

1006

47.90

47.90

97.43
100.00

10 SK

54

2.57

2.57

Total

2100

100.00

100.00

The numeric encoding makes it possible to select respondents in Alberta or British Columbia with shorter and more flexible commands, using the values assigned by the
encode command to each category:
. count if province < 3
293

67

!
Similarly, the qd8 variable coding for gender cannot be easily manipulated in its current
form: Stata needs numeric values attached to each of its categories in order to include it
in a regression model, for example.
A simple solution consists in creating a dummy coding for females, as we previously did
in Application 8e. The data is in string format, so we need to use double quotes and
text instead of numeric values to create the appropriate conditional statement:
. * Creating a female dummy from string values.
. gen female = (qd8 == "Female") if !mi(qd8)

The encode command would produce a similar result, but dummy variables with explicit
names and codes need not feature labels, so we will settle for that simple solution.
The last example concerns the q2_1 variable, which measures the number of music CDs
that the respondent bought in 2005 for his or her personal use. The variable is stored as
a string because it includes both numbers and text, including empty cells. In that state,
the variable is virtually unusable, so we apply several transformations to it:
. replace q2_1a=".a" if q2_1a==""
. replace q2_1a=".b" if q2_1a=="None"
. replace q2_1a=".c" if q2_1a=="Don't Know/Refused"
. destring q2_1a, replace
q2_1a has all characters numeric; replaced as byte
(493 missing values generated)

The first three commands code for missing data where the q2_1a variable featured text
or empty cells. Since the q2_1a variable is based on text, the arguments of the replace
commands feature double quotes. Once the variable contained only numeric or missing
values, we got rid of the string format with the destring command.
Finally, the numlabel command is a handy workaround for serious encoding issues: try
numlabel _all, add to prefix all textual labels with numeric values. This makes the tab
command more practical (a problem solved in this handbook by using the fre command

*
Data management, as shown by the topics covered in Sections 5–8, is not only long, it
is also complex in its more elaborate stages, and very sensitive to even the smallest mistake. We will circumvent that issue by using readymade datasets for which data management will be reduced to a minimum, but in a real research environment, quantita68

!
tive data skills will often extend to data management, in addition to the other skills introduced in Section 4.

69

!

Part 2

Analysis
Statistical analysis requires learning about the statistical theory that underlies the analysis, as well as about the particular procedures that allow to run the analysis from Stata.
This step is the most knowledge-intensive aspect of the course, as it requires to operate several commands while knowing what they correspond to in theory, and how to
interpret their results.
Statistical analysis is a professional, scientific activity. Making mistakes while analysing quantitative data is common, and several rounds of analysis are usually required to
obtain reliable results. In practice, it requires the collective effort of scientists worldwide
to work on large projects and to verify that their respective research does not contain
errors of interpretation.
This class emulates that professional setting by organising small-scale research projects
that are then submitted to the scrutiny of the course instructors. The tools used in the
analysis will be restricted to a selection of statistical tests, and to the most common
form of statistical modelling, linear regression.
At that stage, it is essential that you are familiar with working in Stata, and that you
have read the handbook chapters that document the most basic aspects of data structure, such as sampling. The following section introduces several tests, procedures and
models that will connect your data to particular interpretations, all of which you will be
learning along performing them.

70

!

9. Distributions
Assessing the normality of your dependent variable is essential to your analysis because
the regression model assumes that this variable is normally distributed. This assumption,
and many others that apply to regression modelling, are systematically violated, because the normal distribution is a theoretical construct.
At that stage, you should make sure that you have understood the Central Limit Theorem by reading from your handbook. Try plotting the normal distribution in Stata by
typing twoway function y=normalden(x), range(-4 4), and check that you understand
how standard deviations relate to this curve. As a side note, the course will also mention other (Poisson, binomial) distributions, but you will not be working directly with
either of them.
Thankfully, linear regression is quite robust to deviations from normality in your dependent variable. This basically means that your analysis retains most of its validity even
if your variables express some departure from normality. Still, you should aim at having
as normally distributed a dependent variable as possible.
Assessing normality is a two-step process that starts with visual inspections of the distribution, and then continues with formal tests of normality. Following that step, you
might try different variable transformations to see whether there exists a mathematical
way to make the distribution of your dependent variable approach normality. Finally,
you will have to think about outliers (outstanding observations in your data).
These operations are absolutely essential to your analysis, because quantitative analysis
does not magically proceed by throwing aggregate data at a statistical software solution. Instead, it relies on careful modelling that aims at fitting real-world observations
into abstract models. The ‘goodness of fit’ of your model will determine the quality of

9.1.

Visualizations

Prior to visualizing a variable, you can learn about it with descriptive statistics. These
steps come after reading the dataset documentation, and thus presume that you have
an idea of how the variable is coded and how many observations are available for it,
even if the commands listed below also inspect these:
−

For continuous data, the summarize command (shorthand su) provides the
number of observations for a variable, as well as its mean, standard deviation,
minimum and maximum values.
The summarize command with the detail option will add percentiles and variance, as well skewness and kurtosis, which we will return to when assessing the
normality of the distribution (Section 9.3).
71

!
For more specific operations, the tabstat command is a more flexible tool that
will provide any statistics passed with its s() option, including any of the above
as well as all possible statistics listed in its documentation.
−

For categorical data, the fre command will give you the best approach to the
variable by listings its frequencies while paying attention to its coding and missing values. Install it with ssc install fre.
Using the standard Stata commands, you would have to use the tab command
with the missing and nolabel options, along with the label list command, to obtain similar results.

Once you have learnt enough about your variable through descriptive statistics, you
should turn to visualizations:
−

With continuous variables, you will be using the histogram as your main tool,
complemented with the box plot in order to spot outliers (Section 9.5). Useful
central tendency descriptive statistics will be the mean and median.

−

With categorical variables, you will be using the categorical bar plot, although
its value added to a simple frequency table is often open to question. A useful
descriptive central tendency statistic will be the mode.

The examples below illustrate these options. We will alternate between several of the
course datasets to illustrate how descriptive statistics and visualization work with different types of data.
Important: observations in each example are not systematically weighted, as to keep
the code as simple and demonstrative as possible, but you should apply weights when
your dataset offers them. Weighting, as shown in Example 5a and Example 5b, modify
the statistical importance of observations within the sample in relation to how it was
initially designed. Sample size might also affect your description of the data, as shown
with confidence intervals in Example 9d.
Example 9a. Visualizing continuous data

Data: NHIS 2009

The dependent variable in this example is the Body Mass Index (BMI) for a large sample
of American respondents in 2009, calculated from measures of weight in pounds and
squared height in inches, labelled, and then described, using the following commands:
. gen bmi = weight*703/(height^2)
. la var bmi "Body Mass Index"
. su bmi
Variable

Obs

Mean

Std. Dev.

bmi

24291

27.27

5.134197

72

Min

Max

15.20329

50.48837

!
The interpretation of the statistics above will cover different things:
−

First, the total number of observations (Obs) is satisfying: the bmi variable is
available in a large fraction of the data—actually, for 100% of observations in
the dataset.

−

Then, the average Body Mass Index (Mean) in our sample is remarkably high—
as a BMI of 27 is already considered to be “overweight” in the official categorization of the BMI measure.

−

The standard deviation (Std. Dev.) further qualifies the distribution by giving its
spread: for instance, 95% of all observations fall within two standard deviations
from the mean, i.e. between 22 and 32.

−

Finally, the range of values (Min and Max) indicate that the distribution is
skewed, since the maximum value of 50 is further away from the mean than the
minimum value of 15. A box plot will confirm this.

We then turn to visualizations, using appropriate graphs for continuous data:
. hist bmi, normal percent
(bin=43, start=15.203287, width=.82058321)
. gr hbox bmi

The histogram (shorthand hist) command was passed with two options: normal, which
overlays a normal distribution to the histogram bars, and percent, to use percentages
instead of density on the vertical y-axis. The graph hbox command comes in two distinct words because it belongs to the graph (shorthand gr) class of commands, which
can be passed options to modify its axes, titles and so on.

0

2

4

Percent

6

8

10

From the graphical results of these commands, we observe that the bmi variable is not
normally distributed due to its disproportionate amount of right-hand-side values that
form a long ‘right tail’ in the histogram, and outliers in the box plot:

10

20

30
Body Mass Index

40

50

10

20

30
Body Mass Index

40

50

The histogram shows a distribution that is skewed to the right, and the box plot shows
that BMI values over 40 are outliers to the distribution, located over 1.5 times the interquartile range (Section 9.5 deals with outliers in more detail).

73

!
A precise look at the BMI variable would also reveal that its mean and median are quite
close, indicating some extent of symmetry in the distribution despite the skewness
mentioned before. We will come back to these notions.
Example 9b. Kernel density plots

Data: NHIS 2009

Histograms use bars (or “bins”) to represent a distribution. A different tool to visualize
a distribution is the kernel density plot, which also displays the density of the distribution, but uses smoothed lines instead of bars.
The left-side example below shows the commands to produce a histogram and a kernel
density plot for the distribution of Body Mass Index. The options set the width of the
histogram and kernel density along other options (see help histogram):
. hist bmi, w(2) normal kdens kdenopts(bw(2) lc(red))
(bin=18, start=15.203287, width=2)

0

.02

.02

Density
.04

Density
.04

.06

.06

.08

.08

Kernel density estimate

10

20

20

30
Body Mass Index

40

40

50

Kernel density estimate
Normal density

0

10

30
Body Mass Index

50

kernel = epanechnikov, bandwidth = 0.5945

The graph on the right shows a quicker way to draw a kernel density with the kdensity
command and the normal option. In both graphs, the skewness observed in the kernel
density curve also shows in a comparison of the mean and median values:
. tabstat bmi, s(n mean median skewness)
variable

N

mean

p50

skewness

bmi

24291

27.27

26.57845

.7207431

Example 9c. Visualizing categorical data

Data: ESS 2008

If you are inspecting a categorical variable, you will realise that the distribution of the
variable makes little sense. Furthermore, the tools described above will turn out to be
either inappropriate or of very little help. Instead, you will look at proportions, and you
will need to install the additional fre and catplot packages.
The fre command is particularly useful to handle missing observations. In the example
below, we look at attitudes towards poorer immigration in/to Europe. We can tell from
the distribution of that only 4.5% of observations are missing, and can also read the
percentages of each response item to the question:
74

!
. fre impcntr
impcntr

Valid

Allow many/few immigrants from poorer countries outside Europe

1

Freq.

Percent

Valid

Cum.

5750

11.24

11.78

11.78
45.56

Allow many to come and
live here

2

Allow some

16496

32.26

33.79

3

Allow a few

17054

33.35

34.93

80.49

4

Allow none

9523

18.62

19.51

100.00

100.00

48823

95.47

Missing .a

Total

119

0.23

.b

2144

4.19

.c

56

0.11

Total
Total

2319

4.53

51142

100.00

When visualizing categorical data, follow two recommendations:
−

Do not use pie charts. The human eye is not used to read polar coordinates,
which makes the vast majority of pie charts useless at best, deceitful at worst.

−

Produce a horizontal bar plot of the valid cases with the catplot command, but
ask yourself whether the graph brings any substantial information to the reader.
The answer is most likely negative.

The plots below show the impcntr variable as a histogram and as a categorical bar plot,
but neither visualization brings little more than a frequency table:
. hist impcntr, percent discrete addl
(start=1, width=1)

40

Allow many/few immigrants from poorer countries outside Europe

. catplot impcntr, percent blabel(bar, format(%3.1f)) yti("")

34.93

Percent
20

30

33.79

19.51

0

10

11.78

0

1
2
3
4
Allow many/few immigrants from poorer countries outside Europe

Allow many to come and live here

11.8

Allow some

33.8

Allow a few

34.9

Allow none

19.5

0

10

20

30

40

Frequency tables like the ones produced by the fre command can be formatted to fit
into tables with other descriptive statistics (Section 13.4).

75

!
Example 9d. Survey weights and confidence intervals

Data: ESS 2008

A drawback of plotting distributions without first taking a look at the underlying data
structure is that the resulting plots can hide large confidence intervals. Differences in
proportions that are based on a low number of observations come with large confidence intervals that might minimise – or even cancel – the visual differences that we
might observe on a graph.
In the example below, the survey question from Example 9c is analysed for French
adult citizens only (study: ESS, variable: impcntr, with additional variables to select the
target group made of French adult citizens). The number of valid observations for this
target group is markedly lower than previously, with only 1884 non-missing observations:
. fre impcntr if cntry=="FR" & age >= 18 & ctzcntr==1
impcntr

Valid

Allow many/few immigrants from poorer countries outside Europe

1

Freq.

Percent

Valid

Cum.

151

7.80

8.01

8.01

Allow many to come and
live here

2

Allow some

771

39.84

40.92

48.94

3

Allow a few

704

36.38

37.37

86.31

4

Allow none

100.00

258

13.33

13.69

1884

97.36

100.00

Missing .a

13

0.67

.b

38

1.96

Total

Total
Total

51

2.64

1935

100.00

From that question, the actual proportion of French respondents who support a harsh
anti-immigration policy is hard to determine:
−

As shown in the frequency table above, respondents who prefer allowing
“some” or “many” immigrants from poorer countries outside Europe form a
minority of 48.94%, as calculated from the cumulative distribution of all nonmissing observations. Any politician who plans on campaigning on the issue of
immigration will be interested in the figure, to side with either the minority or
the majority of potential voters.

−

An important issue, however, is that the sample uses survey weights to make its
observations more representative of the actual national population, as explained
in Example 5a. Furthermore, the number of observations in the sample only allows us to estimate values for the rest of the population, therefore involving
confidence intervals.

76

!
If we weight the data before producing the frequency table, using the [aw] prefix and
the dweight variable mentioned in Example 5a, respondents who prefer allowing
“some” or “many” immigrants actually form a majority:
. fre impcntr if cntry=="FR" & age >= 18 & ctzcntr==1 [aw=dweight]
impcntr

Valid

Allow many/few immigrants from poorer countries outside Europe

1

Allow many to come and

Freq.

Percent

Valid

Cum.

163.0514

8.43

8.63

8.63
50.72

live here
2

Allow some

795.0004

41.09

42.09

3

Allow a few

692.9715

35.81

36.69

87.40

4

Allow none

237.9166

12.30

12.60

100.00

100.00

1888.94

97.62

Missing .a

Total

11.82642

0.61

.b

34.23363

1.77

Total

46.06004

2.38

1935

100.00

Total

Survey weights also apply with the [pw] suffix to commands like prop, which computes
the confidence interval of each category based on a normal approximation:
. prop impcntr if cntry=="FR" & age >= 18 & ctzcntr==1 [pw=dweight]
Proportion estimation

Number of obs

=

1884

_prop_1: impcntr = Allow many to come and live here
_prop_2: impcntr = Allow some
_prop_3: impcntr = Allow a few
_prop_4: impcntr = Allow none

Proportion

Std. Err.

[95% Conf. Interval]

impcntr
_prop_1

.086319

.0073835

.0718383

.1007997

_prop_2

.4208712

.0124809

.3963932

.4453491

_prop_3

.3668574

.0120768

.343172

.3905427

_prop_4

.1259525

.0081542

.1099602

.1419447

The confidence intervals above are large because of the limited number of valid observations. We selected the convenience standard 95% level of significance, and intervals
would widen even more at 99% significance.

77

!

9.2.

Options

You might have noticed that many graphs produced above use graph options, often to
modify the unit, scale or title of an axis. Full-fledged books have been written on the
topic, and the most common options for this course follow:
−

Scales that use percentages (percent) or frequencies (frequency) are often more
useful than density or fractions in histograms. Additionally, you will sometimes
pages for each type of graph.

−

You might also need to use the ytitle and xtitle options to give shorter titles to
the axes in plots produced by graph commands, or to remove the titles. The
same applies to the title of your graph, which you can set with the title option.
data source) with the note option.

−

Finally, the xscale and yscale options allow controlling the full range of your axes, along with the xlabel and ylabel options that control the spacing between
labelled ticks. These options also apply to all graph commands and are particularly useful to make the values on your axes correspond to the real set of values
that your data can possibly take.

Example 9e. Democratic satisfaction

Data: ESS 2008

In this example, the main variable of interest is the assessment of democracy in the respondent’s country (stfdem). The codebook indicates that answers to the question
were coded on an interval scale of 0 (“Extremely dissatisfied”) to 10 (“Extremely satisfied”), which means that we can read the mean of the variable as an average score of
satisfaction with democracy. We applied both design and country weights to compute
a European average score of democratic satisfaction, as explained in Example 5b:
. su stfdem [aw=dweight*pweight]
Variable

Obs

Weight

Mean

stfdem

48711

52569.9013

4.628696

Std. Dev.
2.615244

Min

Max

0

10

We could try collapsing individual answers by country in order to observe, for example,
if respondents in Greece are more supportive of democracy than those in Britain. Both
countries can claim a long historical experience with democratic institutions, but Greece
went through an autocratic period in the recent period, and general levels of economic
wealth are lower in Greece than they are in Britain. Grouping observations at the country-level might thus provide some useful heuristics preliminary to our analysis.

78

!
Further to grouping by country (cntr), we will separate citizens from non-citizens
(ctzcntr) to observe any gap in appreciation between both groups. At that stage, the
most straightforward graph command to plot these insights would be to use the following, which also accounts for design weights:
. gr dot stfdem [aw=dweight], over(ctzcntr) over(cntr)

Unfortunately, the default settings produces a distressfully confused result:
−

The vertical y-axis is unreadable because it plots the stfdem variable over 26
countries and over two groups of non-/citizens, ending in 52 lines of graph.

−

Since the average support scores for democracy are not properly ordered, we
will not be able to read the average scores from most to least satisfied.

−

Additionally, we will want to add an informative legend to the scale: the default
one, “mean of stfdem”, is not straightforward enough.

The graph above is almost entirely useless. A suitable dot plot will use several additional
options and look like the following command, which runs over more than one line, as
the “///” breaks indicate:
. gr dot stfdem [aw=dweight], over(ctzcntr) asyvars over(cntr, sort(1) des) ///
>

legend(label(1 "Citizens") label(2 "Foreigners")) ///

>

ytitle("Satisfaction with democracy in country") ///

>

scale(.8) name(stfdem, replace)

DK
CH
NO
CY
FI
SE
NL
ES
DE
BE
IL
SK
CZ
PL
GB
SI
EE
FR
GR
IE
PT
TR
RU
HU
UA
BG
0

2

4
6
Satisfaction with democracy in country
Citizens

Foreigners

79

8

10

!
The graph reveals an interesting gradient of opinions, and sometimes large gaps (at
least at the intuitive, visual level) between the subpopulations of citizens and foreigners.
The list of options used in the graph goes as follows:
−

The vertical axis is called with the over() options, and the asyvars option makes
sure that both citizens foreigners, the categories of the ctzcntr variable, are plotted on the same lines.

−

The sort(1) des option orders the categorical (country) axis by using the descending order of first value displayed on the graph, namely satisfaction among
citizens.

−

The legend and label options allow to rewrite the legend of the graph to “Citizens” and “Foreigners” instead of using the “Yes/No” labels of the ctzcntr variable.

−

The ytitle option provides a title to the continuous axis; this ids different to the
name option, which stores the graph in Stata memory under stfdem, overwriting (replace) any previous graph with that name.

−

The scale(0.8) option serves to reduce the size of all items in the graph to 80%
of their default size, including the labels on both axes and the dots that mark
the average score for democratic satisfaction.

Other useful graph options are:
−

yreverse reverses the continuous y-axis (which is, in fact, horizontal), in order to
plot variables where the coding and labels are inversed, as in questions where
high approval is coded as “1” and disapproval as “4”.

−

yscale(log) switches the axis to logarithmic scale, in order to obtain a better visual differentiation of high values. It is better, however, to transform the variable
if you plan to use a logarithmic scale to measure it.

−

ylabel(1(1)4) and exclude0 modify the continuous axis by respectively setting
the range to 1–4 with ticks every 1 point, and by excluding 0 from the axis of
dot plots, when 0 is not a relevant variable value.

Tweaking the plot to obtain what can be considered a suitable visual result relies on an
admittedly long list of options. However, quality is always preferable over quantity
when it comes to graphs, and these options are useful in many settings.
You might also object that the final graph is still quite difficult to read, and you would
be right: even though recent versions of Stata come with rather powerful graphing capabilities, its poor default settings are sometimes discouraging, and some researchers
prefer to use other software at that stage.

80

!

9.3.

Normality

Normality is the assessment of whether a variable follows the normal distribution.
The normal distribution is an abstract concept: it refers to a curve that can be translated
into a probabilistic situation, called a probability density function. This function is used
to produce estimates of values and their confidence intervals. Cut the music, we need
your full attention for a few moments.
Start with a constant, k = 5. Let’s say that k is the amount of money that you are willing to donate right now to a human rights organization. Captured at this unique moment of time and of your own preferences, this number alone does not vary: k = 5. If I
try to guess it, there is a 100% chance that the answer “5” is correct: Pr(k = 5) = 1. All
other predictions of k have a probability of 0: Pr(k ≠ 5) = 0.
Release a few assumptions and let k vary in space; call it x to mark that step. We keep
our cross-sectional assumption, and will thus leave aside temporal variation in preferences about human rights organizations. Let us assume for now that we want to predict k for the whole population of, say, Russian citizens: how many is each of them willing to donate to human rights organizations? The variable k can now take any value.
Statistical theory intervenes at that point: the more values k can take, and the more observations of k we have, the better we can predict it. This is applicable to coin flipping
as it is to human rights donations, and derives from the Central Limit Theorem, which
can predict the value of k among all Russian citizens from just a sample of them, even if
we do not know how many Russian citizens actually exist.
For our purposes, the population of Russian citizens is the sum of individual preferences
about human rights donations, P = { k1, k2, …, kN }. Each item kn is a value: the
amount of money one Russian citizen would be willing to donate. In this population,
there is a mean value of k. In parallel, if we draw a sample of that population, we can
calculate its mean value, and its standard deviation.
When the population is unobservable as a whole, our objective becomes to estimate a
population parameter, the true mean value of k, from sample parameters, by observing
the mean value of k in a smaller population of n respondents to a survey that was designed to reach Russian citizens and measure the extent of their willingness to donate
to human rights organizations.
The amount of craft and technique that goes into survey design is immense, and the
amount of bias that can be generated at that stage is too substantial not to mention it.
Hopefully, there are thousands of well-run surveys with careful sample design, and the
stability of some results is another way to gain confidence in our ability to measure the
real world, even in its most intricate aspects.
which has interesting properties for building probabilities to estimate a population parameter from its sample parameter.
Regarding normality, the summarize command run with the detail option gives you
several indications that serve to understand the distribution of your variable.
81

!
As far normality goes, you should concentrate on two indicators:
−

Skewness is an indication of how close to being symmetrical the distribution of
your variable is. Skewness should approach 0, since the normal distribution is itself perfectly symmetrical.

−

Kurtosis is an indication of how “thick” the tails of your variable distribution
are. Kurtosis should approach 3, since that number approximates the tails of a
normal distribution.

Remember that the normal distribution is a theoretical construct: deviations to it are
hence natural. You should, however, assess the extent of the deviation from normality,
to know how theoretical assumptions apply to your work.
Example 9f. Normality of the Body Mass Index

Data: NHIS 2009

In the example below, continued from Example 9a, the skewness statistic of the BMI
variable deviates from 0. Its positive sign indicates that the right-hand-side of the distribution is causing that deviation:
. su bmi, d
Body Mass Index
Percentiles

Smallest

1%

18.30729

15.20329

5%

20.11707

15.20329

10%

21.26276

15.20329

25%

23.51343

15.5041

50%

26.57845

75%

30.22843

49.60056

90%

34.32617

95%
99%

Largest

Obs

24291

Sum of Wgt.

24291

Mean

27.27

Std. Dev.

5.134197

50.38167

Variance

26.35998

36.91451

50.48837

Skewness

.7207431

41.59763

50.48837

Kurtosis

3.463278

There are more complex ways to assess normality: some statistical tests apply, such as
the Shapiro-Francia test with the sfrancia command if your data is made of less than
5,000 observations of ungrouped – ‘unpaired’ – data. These tests, however, are eventually less useful than graphic assessment with distributional diagnostic plots: the symmetry plot (symplot), which tests for symmetry, the normal quantile plot (qnorm) and
the normal probability plot (pnorm) all work towards that end.
The last two plots are shown below: what they respectively show is that the BMI variable is deviating from the normal distribution both in its central values (as shown in the
pnorm plot) and at its tails (as shown in the qnorm plot):
. qnorm bmi

. pnorm bmi

82

!

9.4.

Transformations

The tools that are used to find possible transformations of a variable are:
−

−

The ladder command, from which the best transformation can be chosen by selecting the one with the lowest Chi-squared statistic in the table.

Some common transformations apply to macro-data: country population and GDP per
capita are better expressed in logarithmic units. Others apply to micro-data: the distribution of age, for instance, is often better captured in squared units. Transformations
are generally theoretically informed and will matter when interpreting your data.
Example 9g. Transforming the Body Mass Index

Data: NHIS 2009

Running the commands above suggest that BMI approaches normality when measured
on a logarithmic scale (middle graphs):

5.0e-05
4.0e-05
3.0e-05
2.0e-05
1.0e-05
50000
100000
150000
cubic
.002
.0015
.001
5.0e-04
500
1000
1500
2000
2500
square
.08
.06
.04
.02
30
50
identity
.8
.6
.4
.2
6
s
.7
sqrt
.5
1.5
2.5
3
l3.5
4
log
25
15
10
-.25
-.2
-.15
1/sqrt
60
40
20
-.07
-.06
-.05
-.04
-.03
-.02
2
inverse
800
600
400
200
-.004
-.003
-.002
-.001
5
1/square
1.5e+04
1.0e+04
5000
-.0003
-.0002
-.0001
1
0
Density
1/cubic
Body
Histograms
og
02
2
5
qrt
.0e-05
0000
.0e-04
.5
0
00
000
/squareMass
/cubic
by transformation
Index

0

square

50000 100000 150000

0

500 1000150020002500

6

7

0

20 40 60

inverse

-.07 -.06 -.05 -.04 -.03 -.02

30

40

50

0 5 1015 2025
2.5

0 200400600800

5

20

1/sqrt

3

3.5

4

1/square

-.004 -.003 -.002 -.001

-.25
0 5000
1.0e+04
1.5e+04

0 .2 .4 .6 .8
4

10

log
0 .5 1 1.5 2 2.5

sqrt

Density

identity
0 .02.04.06.08

0
5.0e-04
.001
.0015
.002

1.0e-05
0 2.0e-05
3.0e-05
4.0e-05
5.0e-05

cubic

0

-.0003

-.2

-.15

1/cubic

-.0002

-.0001

0

Body Mass Index
Histograms by transformation

Performing a logarithmic transformation in Stata requires to use the ln() function to calculate the natural logarithm of the original BMI variable:

83

!
. gen logbmi = ln(bmi)
. la var logbmi "Body Mass Index (log-units)"

We can now check that the skewness and kurtosis of BMI are closer to 0 and 3 than
they previously were, by displaying the histograms for both ‘raw’ and ‘transformed’
BMI side by side with the graph combine command:
. hist bmi, normal ///
>

title("BMI", margin(medium)) xtitle("") name(bmi, replace)

(bin=43, start=15.203287, width=.82058321)
. hist logbmi, normal ///
>

title("log(BMI)", margin(medium)) xtitle("") name(logbmi, replace)

(bin=43, start=2.7215116, width=.02791236)
. gr combine bmi logbmi, ysize(2)

A visual check is usually enough to observe whether a transformation effectively brings
a variable closer to normality, as it does here:
log-BMI

Density
0

0

.5

1

Density
.05

1.5

2

.1

2.5

BMI

10

20

30

40

50

2.5

3

3.5

4

We can finally check for skewness and kurtosis in both variables, in order to see how
much more symmetrical the transformed variable is (skewness ≈ 0), and how its tails
match the tails of the normal distribution (kurtosis ≈ 3):
. tabstat bmi logbmi, c(s) s(skew kurt)
variable

skewness

kurtosis

bmi

.7207431

3.463278

logbmi

.2346392

2.762445

In this example, both aspects of the distribution are now closer to normality: the rest of
our analysis might hence use the transformed BMI variable. Note that the transformation only affects the unit of measurement for BMI: it does not imply modifying the
actual data beyond that characteristic.

84

!

9.5.

Outliers

Your data might contain outliers, such as a small number of people who earn salaries
that are very, very far above the median income, or a small number of states with excessively small populations. What to do with outliers is primarily a substantive question that depends on your research design.
Conventionally, mild outliers are observations identified by a value located over 1.5
times the interquartile range (IQR) of the variable under examination, and extreme outliers by a value over 3 times the same measure. Refer to the course material for details
on how box plots are constructed.
The examples below use box plots and the extremes package to detect outliers. More
advanced techniques for detecting outliers, either before analysis – using graph matrixes – or during regression analysis – using leverage-versus-residual-squared plots – are
beyond the scope of this course.
Example 9h. Inspecting outliers

Data: NHIS 2009

When we plotted Body Mass Index in the United States, the presence of a large number of outliers was graphically observable on the right-hand side of the box plot distribution (Example 9a).
We have no substantial reason to exclude outliers from this distribution, so we will just
explore them with the extremes command and the iqr(3) and N options to list and
count respondents who are extreme outliers to the BMI distribution (keep in mind that
values of BMI over 40 will usually indicate morbid obesity):
. extremes bmi sex age raceb health, iqr(3) N

N

obs:

iqr:

bmi

sex

age

raceb

health

22943.

3.001

50.38167

Female

17683.

3.017

50.48837

Female

30

Black

Good

28

Hispanic

Good

22511.

3.017

50.48837

Female

63

Hispanic

Very Good

3

3

3

3

3

3

We previously found a transformation of BMI that brought the distribution close to
normality, so excluding outliers would make little sense overall. Furthermore, removing
mild (n = 421) or extreme (n = 3) outliers to the BMI distribution in a sample of N =
24,491 observations would not make a statistical difference.
Example 9i. Keeping or removing outliers

Data: QOG 2011

Detecting outliers can serve a purely informative purpose, but you might also want to
consider working on a subset of your data to exclude outliers from the analysis. In the
85

!
example below, we study private health expenditure as a fraction of gross domestic
product (variable: wdi_prhe). Options passed to the graph hbox command will identify
the outliers:
. gr box wdi_prhe, mark(1, mlabel(cname))

10

.4

The distribution of private health expenditure shows a small number of outliers. Further
exploration of the histogram shows that the outliers are creating a clear deviation from
normality on the right hand side of the distribution:

Sao Tome and Principe

Private Health Expenditure (% of GDP)
2
4
6
8

United States

.3

Liberia

Density
.2

Georgia
Afghanistan

0

.1

Nigeria

0

0

2

4
6
Private Health Expenditure (% of GDP)

8

10

Depending on our research design, we might want to get rid of outlier countries spending more than, say, 5% of their GDP on private health expenditure. This would make
statistical sense, as the distribution of the variable comes closer to normality when we
apply that modification to the data:
. gen wdi_prhe2 = wdi_prhe if wdi_prhe < 5
(18 missing values generated)
. tabstat wdi_prhe wdi_prhe2, c(s) s(n mean skewness kurtosis)
variable

N

mean

skewness

kurtosis

wdi_prhe

188

2.592303

1.337639

5.993119

wdi_prhe2

176

2.33017

.2326665

2.370385

The commands above created a wdi_prhe2 variable by copying values of private health
expenditure from the wdi_prhe variable when these were inferior to 5. The operation
excluded a few countries, and the distribution of the new variable now better satisfies
the normality criteria.
Statistically, it would make sense to stick with the wdi_prhe2 variable for the rest of the
analysis. However, excluding data points (observations) requires a substantive justification: we would thus have to document the exceptionality of health expenditure in the
outlier countries prior to their exclusion.

86

!

10. Association
The section should be rewritten to merge the two first subsections and to treat t-tests
and correlation in a closer way, as to reflect their parametric nature and how they help
understanding regression afterwards. Moving correlation here makes sense since the
subsection on controls can introduce multicollinearity with correlation and graph matrixes. Possible structure: (1) theory (2) crosstabs with all nonparametric tests (3) parametric: t-test (4) correlation (5) controls. A brief note about ANOVA could perhaps
be made in the last section.
In this section, you will test your variables for independence, that is, you will assess
whether you should reject or retain the null hypothesis that states an absence of association between two variables, such as income and education or population density and
GDP.
These tests are useful to your analysis because they will suggest whether your independent variables are suited for inclusion in your regression model. The tests will also
allow to identify some of the interactions that might exist between your independent
variables.
At that stage, you should make sure that you understand the statistical terms that relate
to probability distributions. Remember, for example, that you have to determine the
‘alpha’ level of significance that you will be using before looking at p-values and other
You might also want to check that you understand the core logic of association. The
key problem with understanding causal inference in observational (i.e. nonexperimental) settings is confounding, and crosstabulation is a simple method to minimize confounding. We will later learn about simple and multiple linear regression modelling, other statistical methods developed with a similar purpose.

10.1. Tests
At that stage of your analysis, you will be running independence tests, which by definition contain two groups (and usually two different variables) to compare. Choosing
which test to apply depends primarily on the type of your variables:
−

When both variables are categorical, the test will operate on a table with a few
rows and columns, called a crosstabulation or a contingency table.

−

When the dependent variable is continuous and the independent variable is dichotomous, different tests will operate by comparing differences in means or
differences in proportions.

87

!
−

These tests do not cover all possible situations and form only a preliminary step
to regression, which allows analysing two or more variables of any type. We also leave correlation aside for Section 11.

Bivariate tests introduce some important caveats of statistical tests:
−

Association is not causation: finding that an association exists between two
variables does not imply that the variables are causally related. Both variables
can be – weakly, moderately or strongly – associated, but moving from association to causation requires a theoretical and substantive understanding of the
variables that no statistical test can provide on its own. The same is true of correlation, which we will restate in Section 11.1 before writing our linear regression models.
For example, an association between the level of income and the level of education of individuals does not provide any information on the causal links that relate income to education: we need a theory of how, for instance, the income of
a household influences the educational attainment of its children, and how, in
turn, the level of education of these children will determine their income when
they start to work. The statistically significant association of education and income itself contains, in itself, none of these explanatory, theoretical elements.
The same would be true of an association between music tastes and political
ideology: the direction of the causal link between those characteristics is not situated in the association itself, but in the theoretical understanding of their interplay. Jumping from association to causation with no explanatory theory
would be premature. It is also sadly common.
A more complex example is religion and life expectancy. An association between these variables might seem to indicate a ‘direct’ link at the individual level, where religion affects life expectancy positively or negatively; the same association, however, might also indicate a more ‘indirect’ link at the collective level,
where religion correlates, for instance, with socioeconomic groups of individuals
who enjoy higher or lower life expectancy for reasons (like income) other than
religious beliefs. In this case, jumping from association to causation would have
erroneously advanced a micro-level interpretation to explain a macro-level phenomenon. The mistake of considering individual-level characteristics as explanatory of group-level ones is called the ‘ecological fallacy’ and is also relatively
common.

−

Statistical significance is not substantive significance: finding a statistically significant association between two variables does not imply that there exists a
substantively significant association between them. Some associations can be
statistically significant but devoid of any substantive significance, and conversely.
For example, if a statistical test shows that countries where people drive on the
right-hand side have higher fertility rates, it seems safe to state that this statisti88

!
cally significant association corresponds to no substantive phenomenon occurring in the real world. Conversely, the substantively significant association that
exists between former colonial occupation by European countries and the side
on which people drive might not yield a statistically significant association, for
several reasons such as small sample size, coding errors or unobserved changes
in traffic policy.
Identically, statistical significance does not provide an order of magnitude to the
substantive significance of the association. The statistical strength of an association relies on data and sample size, and does not indicate that the association is
theoretically more important. A study could find a weak association (p < .1) between dictatorship and civil war, and also find a strong association (p < .001)
between oil resources and civil war, and still conclude that dictatorship is a more
important explanatory factor of civil war than the presence of oil. The significance levels of data analysis are frequently confused with theoretical results,
which is why it is crucial to remember that statistical significance only stands for
a conventional indication of significance, usually based on the p < .05 level. Any
further interpretation of significance will have to be theoretically, not statistically, driven.
Example 10a. Trust in the European Parliament
The following examples measure the average trust in the European Parliament. We use
the measures available for the populations of three peripheral European countries, Portugal, Ireland and Greece, which became infamously known as the ‘PIG’ countries during the current financial crisis (study: ESS, variable: trstep).
We start by subsetting the data to these countries:
. keep if inlist(cntry,"GR","PT","IE")
(44939 observations deleted)

Within these countries, we will look at average trust in the European Parliament of citizens and non-citizens, using the same ctzcntr binary variable for citizenship that we
used in Example 9e. The ctzcntr variable needs to be binary if we are to run a t-test on
its two groups (citizens and foreigners).
The command to run a separate t-test for each country runs as follows:
. bysort cntry: ttest trstep, by(ctzcntr)

The results of the first t-test (a test that we explain further in Section 10.4) show that,
in Greece, citizens are less inclined to trust the European Parliament than non-citizens.
The test compared the average trust score of both groups, which was measured on an
ordinal scale of 0 (no trust) to 10 (complete trust), and found out that the difference is

89

!
negative: the “Yes” group of citizens shows less trust in the European Parliament (4.3)
than the “No” group of foreigners (5.8):
-> cntry = GR
Two-sample t test with equal variances
Group

Obs

Mean

Yes
No

1929
72

combined

2001

diff

Std. Err.

Std. Dev.

[95% Conf. Interval]

4.341109
5.888889

.0570617
.3055022

2.50617
2.592272

4.2292
5.279735

4.453018
6.498043

4.396802

.0564503

2.525167

4.286094

4.507509

-1.54778

.3011897

-2.138458

-.957101

diff = mean(Yes) - mean(No)
Ho: diff = 0
Ha: diff < 0
Pr(T < t) = 0.0000

t =
degrees of freedom =

Ha: diff != 0
Pr(|T| > |t|) = 0.0000

-5.1389
1999

Ha: diff > 0
Pr(T > t) = 1.0000

The latter group includes only a few observations (n = 72), and the standard error for
their average trust in the European Parliament is therefore higher (.30) than it is for
Greek citizens (.05). The standard deviations further indicate that the groups have a
relatively similar underlying distribution of trust scores. Still, the confidence intervals do
not overlap, and the t-test concludes that the –1.54 points difference in average trust
scores is statistically significant with only a very, very small risk of error, denoted by the
probability level of the alternative to the null hypothesis being close to, but not equal
to, “0.0000” (middle value).
The ‘lateral’ probabilities confirm the direction of the relationship: the difference in
scores between average trust among citizens, mean(Yes), and average trust among foreigners, mean(No), is highly likely to be negative and highly unlikely to be positive, indicating higher trust among the latter group. To observe this difference and reach that
conclusion, two conditions are met: average trust is effectively different in both groups,
and the sample size of each group is large enough to establish that the difference is statistically significant.
Depending on the actual difference and on the number of observations available in
each group, the t-test might have a harder time at identifying any statistically significant
difference, as shown in the results for Ireland:

90

!
-> cntry = IE
Two-sample t test with equal variances
Group

Obs

Mean

Yes
No

1486
184

combined

1670

diff

Std. Err.

Std. Dev.

4.635262
5.195652

.0589952
.1675218

2.274187
2.272377

4.51954
4.86513

4.750985
5.526175

4.697006

.0557944

2.280073

4.587572

4.80644

-.5603897

.1777167

-.908961

-.2118185

diff = mean(Yes) - mean(No)
Ho: diff = 0
Ha: diff < 0
Pr(T < t) = 0.0008

[95% Conf. Interval]

t =
degrees of freedom =

Ha: diff != 0
Pr(|T| > |t|) = 0.0016

-3.1533
1668

Ha: diff > 0
Pr(T > t) = 0.9992

In these results, the difference in trust is still negative, confirming that foreigners are
more trustful of the European Parliament than citizens in Ireland too; the gap in average trust is smaller than it is in Greece, but the confidence intervals still do not overlap,
and the higher number of observations for foreigners allow the test to work with a
lower standard error for that group. The other results of the t-test are quite similar, including the still very, very low probability levels.
Portugal offers a different picture. The gap in average trust is smaller, and the number
of foreigners in the sample data is very low. As a consequence of both factors, the confidence intervals overlap and no statistically significant difference comes out of the test:
-> cntry = PT
Two-sample t test with equal variances
Group

Obs

Mean

Yes
No

1897
47

combined

1944

diff

Std. Err.

Std. Dev.

[95% Conf. Interval]

4.355825
4.659574

.0554851
.3777784

2.416629
2.589919

4.247007
3.899146

4.464643
5.420003

4.363169

.0549027

2.420704

4.255494

4.470843

-.3037495

.3574689

-1.004813

.3973136

diff = mean(Yes) - mean(No)
Ho: diff = 0
Ha: diff < 0
Pr(T < t) = 0.1978

t =
degrees of freedom =

Ha: diff != 0
Pr(|T| > |t|) = 0.3956

-0.8497
1942

Ha: diff > 0
Pr(T > t) = 0.8022

Example 10b. Female leaders and political regimes
Consider the following example, where we test the average level of democracy (on a
0–10 scale) between two groups of countries, governed respectively by male and female leaders (study: QOG; variables: p_democ and m_femlead).
91

!
The command and its results follow:
Two-sample t test with equal variances
Group

Obs

Mean

0. Male
1. Femal

149
8

combined

157

diff

Std. Err.

Std. Dev.

[95% Conf. Interval]

5.228188
8.125

.3218452
.5153882

3.928621
1.457738

4.592182
6.906301

5.864193
9.343699

5.375796

.3106018

3.891829

4.762268

5.989324

-2.896812

1.39774

-5.657889

-.1357348

diff = mean(0. Male) - mean(1. Femal)
Ho: diff = 0
Ha: diff < 0
Pr(T < t) = 0.0199

t =
degrees of freedom =

Ha: diff != 0
Pr(|T| > |t|) = 0.0399

-2.0725
155

Ha: diff > 0
Pr(T > t) = 0.9801

The results of the t-test seem to indicate that countries run by female leaders are significantly likely to be more democratic at the 95% confidence level (p < .05). What shall
we conclude from this test? In a nutshell, nothing:
−

The dataset includes only 8 countries with female leaders, which leads to high
standard errors, wide confidence intervals, and makes the statistical significance
of the test misguiding. Maximising sample size would be a prerequisite to any
further tests using this variable.

−

Furthermore, there is no theoretical support for drawing any form of conclusion
from these tests: for instance, female leaders like German Chancellor Angela
Merkel or Brazilian President Dilma Rousseff have not made these countries
dramatically more or less democratic, and while democratic elections might have
made their rise to power possible, so many other factors than regime type are at
play in selecting female heads of state that the test in itself is devoid of analytical substance.

−

Identically, the statistical significance of the test cannot be taken as proof that
autocracies are equally likely to be ruled by male or female leaders: even a cursory account of 20th century history will show that autocracies are very unlikely
to be led by female rulers, who are almost systematically excluded from the social groups that provide autocrats, such as high levels of military hierarchies. The
absence of female autocrats in recent history (one has to go back to 18th century Russia to find a genuine female autocracy) makes the test even more absurd.

−

Finally, the democracy and autocracy indexes, despite the fact that they originally come from the canonical Polity IV dataset, are criticisable (more precisely,
the intermediate levels of democracy and autocracy that they provide are highly
problematic). Additional flaws in the data, such as measurement error, are thus
likely to affect independence tests, which makes the jump from association to
causation a rather arbitrary one.
92

!
The few remarks above make an important point: if you cannot substantively justify
your test to match its statistical significance results, you will eventually fall short of saying anything relevant about the independence of its groups. This situation, and many
others, fall under what is sometimes jokingly referred to as a “Type III” error, which
consists in giving the right (statistical) answer to the wrong (substantive) question. The
conventional “Type I” and “Type II” errors are addressed below.

10.2. Independence
Statistical tests provide proof by contradiction. Their logic consists in assuming that we
are wrong to assume any association between our variables, and that the two variables
are in fact unrelated—free of any association. This assumption is referred to as the null
hypothesis, noted “H0”. It would be absurd to think, for example, that voters or political leaders who drink herbal tea are more likely to support racist ideologies: herbal tea
consumption and racism are (hopefully) independent from each other.
Consider a less absurd example: does religion have an effect on political support for
democracy? Hypothetically speaking, holding religious beliefs might affect political
views by increasing individual self-confidence and providing beliefs that either support
or reject democratic rule. The hypothesis might verify any side of the relationship: at
that stage, there is no support for a particular direction. Furthermore, holding religious
beliefs is not just an individual factor: when large groups of individuals hold religious
beliefs, other factors will come into play, such as group persecution or collective dominance, which might also affect how each individual then views democracy. When
working with so many known unknowns, the only safe line of reasoning actually consists in suspending all our former beliefs and… stay agnostic. The null hypothesis thus
states that religion and democratic support have no relationship whatsoever. More substantially, what the null hypothesis means is that virtually 100% of democratic support
can be explained by factors other than religion, such as socioeconomic factors and other contextual elements like the ones cited above.
Example 10c. Religiosity and military spending
Is the average level of religiosity in the population associated to the percentage of gross
domestic product spent on military expenditure? Before jumping to any conclusion,
consider the following:
−

You might have good reasons to think that there is a positive association, if the
question reminds you of how particularly bellicose countries invoke religious
motives to justify, for example, some forms of ‘holy war’—but you already realise, at that point, that ‘holy wars’ can explain only a tiny fraction of military
spending by states worldwide.

−

You might also have good reasons to think that most practices of religion actually preach non-aggressiveness, and would therefore drive states to bring military expenditure down in absence of popular support for it. At that stage, you
93

!
also realise that military expenditure is not necessarily subject to public opinion
pressures, and that even if it is, other factors are likely to come at play, with
possibly greater explanatory power.
−

Finally, you will probably conclude that military expenditure and religion should
not be assumed to be associated by default: the most reasonable approach, and
the statistically correct one, is to adopt an agnostic stance by stating the null
hypothesis, which basically means that “we cannot know from the start, and
that there might just be an association, but only by rejecting the absence of any
relationship can we establish that.”

−

Just for the kicks, you can open the QOG data and try to find a statistically significant association between the wvs_rel variable (an average measure of “how
important God is” to the population) and the wdi_megdp variable (national military expenditure as a % of GDP). It will quickly appear that any categorisation
of either variable is going to distort the data to the point where finding an association will primarily rely on your own manipulations.

You should mentally start any bivariate test with a null hypothesis: even when you are
testing a plausible association, such as between education and racism, you should consider the null hypothesis: these variables are independent from each other, no association exists between education and racism, virtually 100% of racism can be explained by
factors other than education (and vice versa).
Your test then proceeds by trying to reject the null hypothesis. More precisely, it will
provide a probability for the null hypothesis to be verified. That probability is expressed
as the p-value, which varies between 0 to 1. In order to reject the null hypothesis, you
will read the p-value: a p-value close to 0 indicates that the likelihood for the null hypothesis to be verified is weak, whereas a p-value close to 1 indicates that there is high
likelihood for the null hypothesis to be correct.
Consequently, a bivariate test that reveals a statistically significant association will come
with a low p-value. The level below which you can reject the null hypothesis is called
the (alpha) level of significance. By convention, α = 0.05 in most circumstances, for no
other reason than the practical convenience of that decision rule. If p < 0.05, assuming
an association between your two variables comes with less than a 5% risk of assuming
an association where there is actually none—a situation called a “Type I” error, where
you reject the null hypothesis even though it is actually true. The reverse situation,
where you retain the null hypothesis while it is actually false, is called a “Type II” error,
and is frequent in small samples on which statistical tests produce less reliable results.
Note that this explanation confuses significance testing with hypothesis testing, which is
theoretically inaccurate, but acceptable for our purpose here. If you find a statistical
textbook that correctly reports the difference between Fisher’s p and NeymanPearson’s α, then you are reading quite advanced textbooks. The small confusion made
here is technically mistaken in statistical reasoning, but it should reveal as problematic
as other confusions addressed elsewhere in this guide.
94

!
Based on what has just been outlined, you should try to minimize “Type I” and “Type
II” errors in your tests. If you need to establish higher certainty about an association, as
is often the case in studies involving chemicals because the consequences of a “Type I”
error might carry dramatic consequences for the people exposed to them, you will use
α = 0.01 and reject the null hypothesis only if p < 0.01, or even a lower threshold such
as p < 0.001 or p < 0.0001. In parallel, if you fear missing an association by retaining
the null hypothesis when you should have rejected it, thereby making a “Type II” error,
then you should maximize sample size and reduce the number of missing observations,
in order to maximize the number of observations for each variable of interest.
Example 10d. Religion and interest in politics
The Chi-squared tests below test for a relationship between having an interest in politics and belonging to a religion (data: ESS, variables: polintr and rlgblg). The tests are
respectively applied to French and Russian respondents:
. keep if inlist(cntry,"FR","RU")
(46557 observations deleted)
. bysort cntry: tab polintr rlgblg, chi2

As running the tests and reading the results will show, the association of both variables
is statistically significant in France (p < .05), but not in Russia (p > .05). We should
hence reject the null hypothesis for French respondents, and consider that a relationship
exists in this country between the two factors. In the case of Russia, however, we cannot reject the null hypothesis and retain it. There is a small probability that we are
wrong in both cases: in the French case, rejecting the null hypothesis while it is actually
true would lead to a “Type I” error, and in the Russian case, retaining the null hypothesis while it should have been rejected would lead to a “Type II” error.
Statistical tests such as the Chi-squared test operationalize the probabilistic logic described above. Take, for example, the results of the above Chi-squared test for French
respondents, showing column percentages instead of frequencies:
. tab polintr rlgblg if cntry=="FR", col nofreq chi2

How interested in
politics
Very
Quite
Hardly
Not at all

Belonging to
particular religion
or denomination
Yes
No

Total

interested
interested
interested
interested

13.57
38.72
35.33
12.38

17.62
33.55
33.65
15.17

15.66
36.06
34.46
13.81

Total

100.00

100.00

100.00

Pearson chi2(3) =

12.5681

Pr = 0.006

The null hypothesis states that interest in politics is independent from religion. In that
case, there should be as many “very interested” respondents among French religious
95

!
believers and non-believers, but this is not the case: very high interest in politics
(15.06% of all observations) is over-represented among non-believers and underrepresented among believers.
Using the frequencies of each variable, the hypothetical frequencies of their crosstabulation in absence of any relationship between them can be calculated; these expected
values are then computed against the observed frequencies to calculate the Chisquared statistic. This statistic is then combined to the degrees of freedom, which corresponds to the number of cells in your crosstabulation of ethnicity and party preference,
minus one row and one column.
Your handbook details both computations, and will also give you a table which crosses
the Chi-squared statistic with its degrees of freedom to obtain the p-value of the association between both variables. The underlying logic of the Chi-squared test (which is
also called the ‘goodness of fit’ test) is essential to your understanding of how hypothesis-testing works: make sure that you are familiar with it before moving on to further
techniques.
This short example yields a Chi-squared of 12.5 against 3 degrees of freedom, with a
corresponding p-value of “0.006” that really means “below 0.006”. The small p-value
allows us to reject the null hypothesis with great certitude, as there is less than a 0.6%
risk that we are making a “Type I” error. We can therefore reject the null hypothesis
and confirm our alternative hypothesis, which states that interest in politics and religion
are not independent but significantly associated in the French sample of respondents.
At that stage, however, we still lack a plausible theory to explain that association substantively. Even if we find an explanation, reading the whole crosstabulation will show
that the relationship seems more complicated than expected: religion does not increase
or decrease interest in politics uniformly across all groups. We will also want to make
sure that the effect of religion on interest in politics is not cancelled by, for example, the
average age of respondents in each group. The Chi-squared test leaves these questions
unanswered: more sophisticated tests will (start to) address them in Section 11.

10.3. Crosstabulations
Most bivariate tests combine two categorical variables, such as income groups, levels
of education, geographical regions or regime types. Crosstabulations of such variables
are especially frequent with survey data, where the answers given by the respondents
are coded as nominal variables, such as religion, or as ordinal variables on ‘short’ scales,
such as agreement scales that usually range over 3 to 12 items, from “Strongly agree”
to “Strongly disagree”.
These tests combine variables in a “r x c” (rows by columns) contingency table. The
intersection of each row with each column forms a cell that contains the number of observations (called a cell count) for that intersection. Tables can be made easier to read

96

!
with row and/or column percentages, but the type of test to use ultimately relies on cell
counts, as shown in the examples below.
Example 10e. Legal systems and judicial independence
An ever larger number of countries is running elections, but fraud or candidate intimidation was reported in at least a fifth of the cases reported between 2001 and 2008
(study: QOG, variable: dpi_fraud). Among the countries affected by electoral fraud,
some are also former colonies of Western imperial powers:
. tab fcol dpi_fraud, exact
Fraud or Candidate
Intimidation
Former

Affection

colony

0

1

Total

0

76

23

99

1

53

11

64

Total

129

34

163

Fisher's exact =

0.431

1-sided Fisher's exact =

0.234

The results above show Fisher’s exact test, which is superior to the Chi-squared test on
‘2 x 2’ contingency tables like the one above. The test produces a unique statistic that
can be read as a p-value for the likelihood of an accidental association to exist between
the variables. Here, the risk of a coincidental association between former colony status
and electoral fraud is far from meeting any reasonable level of significance. We can retain the null hypothesis and look for other explanatory factors of electoral fraud.
The Chi-squared test is also inferior to Fisher’s exact test when some cells in the crosstabulation hold less than five observations. When that ‘5+’ convention is violated, Fisher’s exact test is recommended over the Chi-squared test.
There are many more tests available to test for association in categorical data. Cramér’s
V, for instance, is a test that complements the Chi-squared test by providing a measure
of strength for the association. More nonparametric tests were also designed to test for
particular associations:
−

When both variables are ordinal, Spearman’s rho is a ranked correlation coefficient that better captures association than the tests mentioned above. An
equivalent test, Kendall’s tau, uses a different computational logic (closer to the
Gamma test) to achieve similar results.

−

When both variables are interval/ratio (more simply: continuous), Pearson’s r
provides a correlation coefficient that we will explore in Section 11, where we
cover correlation and regression as stronger analytical tools that go beyond testing for independence.

97

!

10.4. Comparisons of means
A common approach to some quantitative indicators will involve measuring a continuous variable in two discrete groups, such as males and females. When a difference appears between these groups, it is often measured as a difference in means. This setting
is common in experiments, such as when we measure the average literacy of a group of
children who were given free schoolbooks against the average literacy of another group
of children whose parents had to pay for the same schoolbooks.
The comparison of means works by running a t-test, which computes the mean of a
continuous variable over two groups provided by a categorical or binary (dummy) variable. The test compares the means by estimating whether their difference is statistically
significant. This method also appears in other tests, especially when the two groups can
be paired, as in the case of control and treatment groups. When conditions are met for
running an analysis of variance (ANOVA), as is common in psychological and clinical
studies, comparing means of a continuous variable (such as blood pressure) across two
groups of patients (e.g. those who received some medication and others who received
a placebo) is a standard technique to establish the causal effect of a given treatment.
Note: if your continuous variable is in fact coding for a binary outcome such as the result of a medical procedure that succeeds (1) or fails (0) to cure a patient, then the distributional assumption of normality on which the t-test relies will be violated. In that
case, you should use the prtest with exactly the same syntax as the ttest command, as
in prtest cure, over(gender). In the social sciences, this is relevant to dummy variables
coding for dichotomous outcomes such as the fact of being divorced (1) or not (0).

Example 10g. Gender and left-right political positioning
Political parties rely on different electoral clienteles and sometimes assume that positioning on the left–right spectrum significantly differs for men and women. A simple
test thus consists in measuring the mean left–right positioning of men and women for
several countries. We will start by looking at aggregate scores of left–right positioning
at the country level (data: ESS, variables: lrscale and gndr):
. gr dot lrscale, over(gndr) asyvars over(cntry, sort(1) des) ///
>

exclude0 ylab(1 "Left" 10 "Right") ytit("") scale(.85)

The graph, which was slightly scaled down with the scale option for cosmetic reasons,
ranks countries by the left–right score of males. It shows no consistent pattern for the
average scores of males and females at the macro level:

98

!

IL
TR
FI
PL
NO
HU
DK
CZ
EE
NL
RU
IE
UA
SE
GR
BG
GB
BE
CH
CY
SK
FR
PT
DE
SI
ES
Left

Right
Male

Female

The t-test then indicates an interesting result, which deserves some attention. In the
results below, the t-test is indeed statistically significant, but substantively insignificant.
We will read through each part of the test to reach that conclusion:
. ttest lrscale, by(gndr)
Two-sample t test with equal variances
Group

Obs

Mean

Std. Err.

Std. Dev.

[95% Conf. Interval]

Male

20620

5.211397

.0158609

2.277571

5.180308

5.242485

Female

22682

5.119566

.0146951

2.213164

5.090763

5.14837

combined

43302

5.163295

.0107862

2.244507

5.142154

5.184436

.0918305

.0215926

.0495087

.1341524

diff

diff = mean(Male) - mean(Female)
Ho: diff = 0

t =

4.2529

degrees of freedom =

43300

Ha: diff < 0

Ha: diff != 0

Ha: diff > 0

Pr(T < t) = 1.0000

Pr(|T| > |t|) = 0.0000

Pr(T > t) = 0.0000

The interpretation of the t-test goes as follows:
−

WRITE.

99

!
−

From the theoretical part of the course, you should remember at that point that
the p-value for the alternative hypothesis is derived from the t-statistic, and that
the sum for both directional hypotheses diff < 0 and diff > 0 will always be 1).

Example 10h. Obesity and racial-ethnic profiles
We continue to explore the Body Mass Index of U.S. respondents (study: NHIS, variable: sex, raceb and bmi, as previously calculated in Example 9a), by looking at the
breakdown of BMI by gender groups and four main racial-ethnic profiles. We start by
plotting the average BMI for four ethnic groups (variable raceb). Graphically, we want
three separate dot plots showing the average BMI for each racial-ethnic profile among
males and females separately and for both sexes:
. graph dot bmi, over(raceb) ///
>

by(sex, rows(3) total note("")) ytit("Average Body Mass Index")

In this presentation, the three graphs are horizontally aligned by using the rows option
to fit them into a single column of three rows. The plot for both gender groups is further generated by the total option. Through visual inspection of the variables of interest, we can establish whether gender and race might account for some variation in the
Body Mass Index of U.S. respondents:
Male
White
Black
Hispanic
Asian

Female
White
Black
Hispanic
Asian

Total
White
Black
Hispanic
Asian
0

10

20

30

Average Body Mass Index

The sum of observations that could be derived from this graph cannot be brought together into a single bivariate test; instead, multiple linear regression will be used to describe the joint effects of gender and race on BMI (Section 11). A single comparison can
focus, however, on the markedly lower BMI of Asian respondents in reference to all
other racial-ethnic profiles.
100

!
The t-test will not accept more than two values for the independent categorical variable, so we had to create a dichotomous (or binary) variable, coding “1” for “Asian”
and “0” for any other racial-ethnic profile. As we do so, we must be careful to prevent
missing data from being coded as “0”, as it would distort the data if some respondents
did not report their racial-ethnic profile.
We do not need to use recode to generate a full-fledged variable with proper labels at
that stage: a dummy that we will create on the fly is enough. A single line of code generates that dummy through the logical statement (raceb==4), which returns 0 when
false and 1 when true. Where the raceb variable indicates the value for “Asian”
(raceb==4), it will code 1 and 0 otherwise.
Using this statement, we create the dummy variable asian for all non-missing observations of raceb, using the additional logical statement if !mi(raceb) to that effect, and
then finally check our operation with the su command, for which the mean indicates
the percentage of Asians in the sample:
. gen asian=(raceb==4) if !mi(raceb)
. su asian
Variable

Obs

Mean

asian

24291

.0564407

Std. Dev.

Min

Max

0

1

.2307754

We then run a t-test for BMI between Asians and non-Asians:
. ttest bmi, by(asian)
Two-sample t test with equal variances
Group

Obs

Mean

0

22920

27.44618

.0340063

5.14833

27.37953

27.51284

1

1371

24.32456

.1037125

3.840163

24.12111

24.52801

combined

24291

27.27

.032942

5.134197

27.20543

27.33457

3.121622

.1413385

2.84459

3.398654

diff

Std. Err.

Std. Dev.

diff = mean(0) - mean(1)
Ho: diff = 0

[95% Conf. Interval]

t =

22.0861

degrees of freedom =

24289

Ha: diff < 0

Ha: diff != 0

Ha: diff > 0

Pr(T < t) = 1.0000

Pr(|T| > |t|) = 0.0000

Pr(T > t) = 0.0000

The interpretation of the t-test goes as follows:
−

Out of N = 24,291 respondents for which we could calculate a Body Mass Index, the average BMI approaches 27.4 for n = 22,920 non-Asian respondents
101

!
and 24.3 for n = 1,371 Asian respondents. The standard error is larger for
Asians, since we have less observations for that group.
The difference in means shows that the BMI of Asians is inferior by roughly 3
points to the BMI of non-Asians in the United States. Furthermore, the standard
deviation of BMI within the Asian group is smaller than for the rest of the sample, indicating that the distribution of BMI among Asian respondents around its
mean is more compact.
Pause at that stage and interpret substantively the difference. Body Mass Index
follows an international standard, where a BMI of 25 indicates overweight.
Therefore, the sample average respondent is overweight by our results. However, this does not to Asian respondents, who are slightly below, but not that
much below, that conventional threshold.
−

The null hypothesis (Ho) predicts that the difference in means between Asian
and non-Asian respondents is null (Ho: diff = 0). If the difference is non-null,
the null hypothesis further estimates the probability for that difference to be
due to sampling error (although it naturally cannot correct for measurement error).
The null hypothesis hence tests the following statement: “If the Body Mass Index is strictly independent from race, then any difference in means between
Asians and non-Asians is accidental.” Rejecting the null hypothesis amounts to
rejecting that statistical statement, which concerns statistical significance; additional observations about the causes and reasons of that difference will require a
substantive theory, such as a difference in physiological and nutritional determinants among Asian respondents.
The t-test actually shows a difference, noted diff = mean(0) – mean(1), that appears to be statistically robust, therefore contradicting the null hypothesis. In
this example, on average, the Body Mass Index of Asians (the group for which
the by variable, asian, takes the value 1) is lower to the BMI of non-Asians
(group 0) by approximately 3 points (diff = 3.12, 95% CI = 2.84–3.39).

−

The alternative hypothesis (Ha) predicts that there is a meaningful association
between race and Body Mass Index, which should cause the average BMI of
Asians to differ substantively from the average BMI of non-Asians. This hypothesis implies that the difference in average BMI between both racial-ethnic profiles should be significantly different from zero (Ha: diff != 0).
The p-values for the t-test (Pr) indicate that we can reject the null hypothesis
Ho because the p-value for (Ha: diff !=0) is inferior to our level of significance,
α = 0.05. At that stage, we gain empirical confirmation of what we previously
observed graphically. More precisely, the test shows that subtracting the mean
BMI of Asians to the mean BMI of non-Asians is very likely to give a positive result: the probability level for Ha: diff >0, is highly significant (p < .01).
102

!
Interpreting a t-test requires reading all the information used in this example,
but reporting a t-test is usually much quicker. Comparing means between
groups that fit with reasonable theoretical expectations generally just requires
reporting the existence of a significant difference. Other results are less important, given that we will obtain more precise estimates of the difference by
including racial-ethnic profiles with other variables into our regression model.
Example 10i. Political regime and female legislators
The t-test can quickly run into issues of statistical significance if it is run on a low number of observations. The following example tests the hypothesis according to which
federal regimes lead to higher representation of women in parliaments. The hypothesis
could be tested only a small group of countries for which the data were available (data:
QOG, variables m_wominpar and pt_federal):
. ttest m_wominpar, by(pt_federal)
Two-sample t test with equal variances
Group

Obs

Mean

0. No fe

68

1. Feder

13

combined

81

diff

Std. Err.

Std. Dev.

[95% Conf. Interval]

16.06029

1.284192

10.58972

13.49704

18.62355

18.42308

2.879484

10.38213

12.14922

24.69693

16.43951

1.169831

10.52848

14.11147

18.76754

-2.362783

3.196071

-8.724404

3.998838

diff = mean(0. No fe) - mean(1. Feder)
Ho: diff = 0

t =

-0.7393

degrees of freedom =

79

Ha: diff < 0

Ha: diff != 0

Ha: diff > 0

Pr(T < t) = 0.2310

Pr(|T| > |t|) = 0.4619

Pr(T > t) = 0.7690

A first issue here has to do with the large standard errors caused by the low number of
observations, but we have no other choice of data. A second issue then has to do with
the large standard deviations, which indicate that the mean does not capture well the
distribution of women in parliament. Finally, an issue here could be that both variables
were measured at different points in time, but we expect regime type to be stable.
The last issue is serious, as shown by two kernel density plots that show the distribution
of the m_wominpar variable for each regime type. The code for the graph is pretty esoteric, as it involves combining two graphs with the cryptic tw (“two-way”) operator,
using additional operators () || () to separate them:
. tw (kdensity m_wominpar if pt_federal) || ///
>

(kdensity m_wominpar if !pt_federal), ytit("Density") ///

>

xtit("Women in parliament (%)") ///

>

legend(lab(1 "Federal regimes") lab(2 "Non-federal regimes"))

103

!

0

.01

Density
.02
.03

.04

.05

Against our initial insight, the resulting graph indicates that non-federal regimes actually
reach into higher values of women in parliament, whereas the rest of the distribution is
pretty similar for each regime type:

0

10

20
30
Women in parliament (%)

Federal regimes

40

Non-federal regimes

The interpretation of the t-test goes as follows:
−

In the N = 81 countries for which we have data for the year 2008, women occupy an average of 18% of parliamentary seats in federal regimes, and an average of 16% in non-federal regimes. The difference in means hence shows that
the percentage of women in parliaments is inferior by roughly 2 percentage
points in non-federal regimes.

−

The null hypothesis (Ho) predicts that the observed difference is caused by
sampling error, and is therefore accidental (or coincidental) rather than significant (Ho: diff = 0). The alternative hypothesis (Ha) states that the difference reflects a significant association, and that the difference in the average number of
women in parliament is not null (Ha: diff != 0) but rather corresponds to a substantive difference between both regimes.

−

The last lines of p-values (Pr) indicate that we cannot reject the null hypothesis,
as the p-value for Ha: diff !=0 is highly superior to any reasonable level of significance. We should conclude that, for the countries under examination, federal rule has no significant incidence on the representation of women in parliaments.

104

!

10.5. Controls
An interesting use of bivariate tests resides in identifying control variables, which are
independent variables that might have affect other independent variables, therefore
mitigating the interpretation of the relationships that you might come to observe between your variables.
Example 10f. Party support (with controls)
The test featured in Example 10c has confirmed an association between party support
for the British National Party (BNP) and gender, with men being more likely to support
the BNP than women (study: BNP, variable: ). Other factors come into play: for instance, a t-test would reveal that members of trade unions are less likely to support the
BNP than non-members.
If, in turn, women are more likely to be members of trade unions, it is difficult to know
if BNP support is influenced by gender or by trade union membership. Similarly, men
who are members of trade unions might be even less supportive of the BNP than females who are also trade union members, which would make the relationship more
complex.
Using the graph hbar command with two over options, we can plot BNP support over
both gender and trade union membership. In the resulting graph below, it becomes apparent that gender still influences BNP support, even when controlling for trade union
membership:
1.26596
1.57647
1.70663
.1.97619
0
.5
1
1.5
2
2.5
Feelings
Yes
No
female
male
5.5
towards the BNP

male

1.97619

No
female

1.70663

male

1.57647

Yes
female

1.26596

0

.5

1
1.5
Feelings towards the BNP

105

2

2.5

!
The code for the graph shows that we modified several things (we added bar labels,
and we also modified the title and scale of the axis):
graph hbar bnp, over(sex) over(union2) ylabel(0(.5)2.5) ytitle("Feelings towards the BNP") blabel(bar)

We could also ‘control’ for the effect of gender on party support by verifying that, if
men tend to be more supportive of the BNP, an extreme-right party, they also tend to
be less supportive of political parties on the opposite end of the right/left spectrum.
The graph below shows support for the BNP and for several other parties, still broken
by gender. As it appears, men are indeed less politically supportive of parties situated
on the left; the difference in support becomes negligible on the right, and becomes visible again on the extreme-right.
4.68682
4.57834
4.93448
1.61068
4.54186
4.37626
4.94836
1.89549
0
1
2
3
4
f5
6
female
male
BNP
Conservatives
Green
Labour
emaleParty

1.89549
4.94836

male

4.37626
4.54186

1.61068
4.93448

female

4.57834
4.68682

0

1

2

3

BNP
Green Party

4

5

6

Conservatives
Labour

Once again, the code for the graph comes with different modifications of the axis, bar
labels, and legend:
graph hbar bnp con green lab, over(sex) legend(label(1 "BNP")
label(2 "Conservatives") label(3 "Green Party") label(4 "Labour")) blabel(bar) ylabel(0(1)6)

As the language above suggests, the differences and interpretations that we have given
are entirely tentative, since they are based on graphical comparison. By running the
95% confidence intervals for the average support of men and women to each party
(displayed below), we can see that the differences are not actually robust to the standard error of the mean: the confidence intervals for male and female BNP supports overlap, which means that the difference in support between males and females might be
attributable to sampling error.
106

!
. bysort sex: ci bnp con green lab

-> sex = male
Variable

Obs

Mean

bnp
con
green
lab

842
852
792
860

1.895487
4.948357
4.376263
4.54186

Variable

Obs

Mean

bnp
con
green
lab

899
992
868
1009

1.610679
4.934476
4.578341
4.686819

Std. Err.

[95% Conf. Interval]

.088887
.0850586
.0830203
.0893638

1.72102
4.781408
4.213297
4.366464

Std. Err.

[95% Conf. Interval]

.0721949
.0796756
.0765306
.0847027

1.468988
4.778124
4.428134
4.520605

2.069953
5.115306
4.539229
4.717257

-> sex = female

1.752369
5.090828
4.728548
4.853032

Bivariate tests, along with other procedures such as confidence intervals, should hence
you have sufficiently explored these relationships, you should move to a statistical procedure that will allow to estimate a particular form of relationship between two variables while controlling for the effect of several other variables: linear regression.
One last word of caution before moving to regression models: remember that statistical significance is not substantive significance. More precisely, statistical significance is
neither necessary or sufficient for substantive significance: some associations can be
statistically significant and yet devoid of substance, whereas others can be substantively
significant and yet too weak on independence tests. Types I and II errors stay possible
even after all possible tests, due to the fact that you are using probabilities with confidence intervals and standard errors.
Consequently, your own capacity to interpret the data cannot be replaced by a statistical test. Instead, the tests should be used as heuristics: they should lead you to reflect
further on the quality and reliability of your data, and on your predictions regarding

107

!

11. Regression
While correlation is enough to establish and qualify a relationship between two variables, linear regression adds some predictive value to your analysis. In statistical analysis, prediction consists in identifying an equation that can predict approximate values
of a dependent variable from the values of independent variable(s).
Linear regression is a form of statistical modelling that predicts a dependent variable
from a linear, additive relationship between one or more variables. In a simple linear
regression, there are two variables, one dependent (explained) and one independent
(explanatory). In multiple linear regression, there is still only one dependent variable
but many more independent variables.
As its name indicates, linear regression captures only linear relationships. If your variables are related in any other way, as in exponential or curvilinear relationships (think of
the Kuznets curve in environmental economics), your regression will reflect it only very
poorly (if at all). Variable transformation as presented in Section 9.3 might or might not
solve that issue; other techniques that reach beyond linear regression modelling can
then be used to better model quadratic or polynomial relationships.
A final word of caution: regression is not causation. While regression allows you to
identify predictive relationships between independent and dependent variables, it does
not allow you to identify a causal link between them. Only a substantive theory can
causally relate, for instance, income with life satisfaction, or gross domestic product
with low prevalence rates of HIV/AIDS. With linear regression, you will find coefficients
to relate both variables, but the causal link that might exist between them is a matter
of interpretation at that stage.

11.1. Theory
The steps followed by regression modelling are fairly identical to those that you followed when you analysed the distribution of your variables:
−

Start by plotting your data to understand what you are working with. When
working on a single variable, you used histograms and frequency tables (Section
9). With two variables or more, you will be looking at scatterplots and scatterplot matrixes to look for linear relationships.

−

Continue by summarizing your data using numerical measures. When working
on a single variable, you used central tendency and dispersion. These measures
are still relevant at that stage. You will add correlation measures to identify patterns between pairs of variables.

−

Finish by testing a model. When looking at a single variable, you tested the
normality of the variable (Section 9.3). When looking at several variables using
108

!
regression modelling, you will be testing the existence of a linear relationship
between them.
Note that you will not be able to read regression results properly without a clear understanding of what the standard deviation is. You will also need to read p-values as well
as F and t statistics. This guide does not cover the full detail of the theory behind regression. Thus, you should turn to the corresponding handbook chapters before going
further with the analysis.
Identically, prior to reading regression results per se, you will need to read some correlation coefficients, which are quite straightforward: Pearson’s r is a number ranging
from -1 to +1, with proximity to either -1 or +1 indicating a linear relationship; the correlation itself can be significant or not, and therefore also comes with a p-value.
Finally, some caveats related to those mentioned in Section 10.1 apply:
−

Correlation is not causation. Correlation is symmetric: it tests for the strength
of any relationship between two variables, and the relationship can theoretically
go both ways. It is only by substantive interpretation that you can make an
asymmetrical causal claim, as in “age causes religiosity”, because the causal arrow between these two variables can go only one way (i.e. the counterfactual is
impossible: religiosity cannot influence age, whereas it is possible for age to influence religiosity).

−

Beware especially of the ecological fallacy, which might wrongly lead you to
make inferences about the individuals of your sample while looking at grouplevel data. For instance, the high life expectancy in France is a group-level characteristic that does not apply at the individual level—otherwise, speaking French
while smoking and drinking would protect you from developing cardiovascular
disease or cancer. Similarly, a U.S. Republican candidate can come first in poor
U.S. states even if poor voters tend to vote for the Democrat candidate.

11.2. Assumptions
For the linear regression model to run properly, it should be applied to a continuous
dependent variable. If your dependent variable is categorical, you can still run your regression as long as the variable has a scale: you can typically treat an ordinal variable as
numeric (continuous).
All kinds of variables can be used as your dependent variable: gross domestic product,
an electoral score, a scale of life satisfaction or level of education, even a binary outcome, such as “democratic or not” (0/1)—can all be submitted to regression analysis.
There are other assumptions and techniques that apply to regression but that we will
ignore, because the course is introductory and limited in scope:
109

!
−

We will not cover selection techniques (such as nested regression) that can be
used to pick the right independent variables for your regression to reach its
highest predictive value. Instead, we will stick with the independent variables
that you chose by intuition and prior knowledge.

−

We will not go in depth into categorical regression models that apply specifically to binary outcomes (logit and probit), nominal dependent variables (multinomial), and so forth. Regression with categorical data can use very sophisticated
models, taught in intermediate courses.

−

Finally, we will not run regression diagnostics [actually, we just might] which
consist in studying the residuals of your regression model in order to validate
the other assumptions to regression analysis.

The fact that we will not take these assumptions and techniques into account does not
invalidate your regressions straight away. Even if the predictive value and precision of
your final model could have been improved, your work will still yield interesting results.
Linear regression knows many variants. Ordinary Least Squares (OLS) is only one
method, as is its expanded version, Generalized Least Squares. Another version, twostep least-squares regression (2SLS regression), is equally useful. [Explain these briefly.
Handbooks rarely provide such an overview.] Another very useful one, weighted least
squares regression, can be used to detect nonlinear relationships in scatterplots by fitting a locally weighted scatterplot smoothed curve (LOWESS curve).
Example 11a. Foreign aid and corruption
This example uses several graph options and combines a linear fit to a quadratic fit and
LOWESS curve, in order to show the many problems that a simple linear regression can
obscure (study: QOG, variables: ti_cpi and wdi_aid with some light recoding). The correlation between foreign aid distributed as development assistance and an index of corruption is indeed satisfying (r = – 0.265, p < 0.01), but a graphical look at the linear fit
shows that it poorly represents the data. Furthermore, the quadratic fit and LOWESS
curve show a nonlinear relationship that we would miss with a simple linear regression.
Finally, the scatterplot also reveals several outliers, like Singapore, Israel and Bangladesh.

110

10

!

2

4

6

8

SGP

CHL
BRB
ISR
MLT
LCA
BWA
OMN
VCT
CYP
BHR
SVN
BTN
QAT
ESTNAM
KWT
ARE
URY
TTO
MYS
CPV
HUN TUN
BLR
LTU
ZAF
CUB
KORCRI
SAU
BLZ
MUS
DMA
WSM
JOR
SYC
SUR
JAM
FJI
BRA BGR PER GHA
POL
HRV
LVA
SVK
CZE
LKA COL
MAR
MEX
GRD
DOMSLV
LSO
SYR
KIR
MDV
GAB
MNE BEN
LAO
BIH
THA
TUR
VUTIRN MNG
MRT
RWA
SEN
PAN
ARM
LBN
MLI
BFA
DJI
MWI
SLB
ARG UZB
NPL
SWZ
STP
ZWE
HNDPHL
COM
TLS
ERI
DZAROU
ZMB
VEN
GMB
GUY
GTM
ALB
NICYEM
TGO
CAF BDI
GEO
UKR
KAZ
MKD
SDN
KHM CMR
LBR
COG
GNB
IRQ
HTI
ECU
NER
SLE
BOL
LBY
KGZ
SOM
PNG AZE
UGA
TKM MDA
GNQ
GIN KEN
TJK
TON
PRY
AGO
MMRTCD
NGAMDG
BGD

0

500

ETH
EGY
CIV

CHN
SRB

TZA
RUS IND
AFG
VNM
COD IDN

1000

1500

MOZ
PAK

2000

Net Development Assistance and Aid (Current Million USD)
Corruption Perceptions Index

Linear fit

LOWESS

Sources: Transparency International, World Bank. Lowess curve bandwidth = 1.

This example illustrates the many difficulties of regression modelling: even at the level
of two variables, providing a faithful account of a relationship can require complex
transformations or fairly advanced techniques. It can take a very long time to produce a
satisfying model, and even longer to produce a meaningful interpretation based on its
statistical results.
Once you also begin to consider measurement issues, omitted variable biases and possible variations over time, the analysis reaches a level of complexity that requires fulltime specialization in quantitative methods.

11.3. Correlation
Correlation is the most straightforward way to test for independence between two continuous variables, by looking for a pattern in their covariance. Stata lets you build correlation matrixes, from which you can read correlation coefficients for any number of
variables. Correlations can be partial or ‘semipartial’ and can be computed onto different subsamples, and each method comes with strengths and weaknesses.
When running a correlation, use the pwcorr command to select all observations for
which the values of both variables are available. The ‘pw’ prefix in the name of the
command stands for pairwise deletion of missing data: it means that this command
will calculate each correlation coefficient by using all the observations for which the pair
of variables used by the calculation are available.
An important problem with pairwise deletion is that each correlation coefficient ends up
being calculated on a different part of the sample held by the dataset, i.e. on a different
111

!
subsample. This creates serious issues of external validity: it not only limits the ability
to compare correlation coefficients obtained under that method, but it also threatens
the possibility to generalize them to the population represented by the sample.
When building a correlation matrix, it is generally more reasonable to deal with missing
observations by excluding all observations for which any of the variables are missing.
This method, called casewise, or listwise, deletion of missing data, is implemented by
the corr command. Still, depending on your data structure, this method might result in
excluding a very large fraction of the observations contained in your dataset when calculating correlation coefficients, which again threatens the representativeness of your
sample.
The problems outlined above are critical when your data contains many missing observations, which might excessively distort the correlation coefficients and limit generalization. There is no statistical solution to these issues because they emerge at the level of
data collection and might only be solved at that stage. An acceptable procedure consists in adding the obs option to the pwcorr command, in order to get the number of
observations used in calculating each correlation coefficient. Any important variation in
these numbers should be interpreted as a threat to external validity, which you should
take into account while interpreting the correlation matrix.
Add the sig option to your pwcorr command to obtain significance levels in your correlation matrix. For improved reading, use the star(.05) option to add a star next to statistically significant correlations at p < .05. The strength at which you should start considering a correlation is a substantive question that depends on your research design, but a
value of 0.5 is usually a good start to identify strong correlations, and a value of 0.25
might identify moderate correlations.
Due to their multiple issues of validity, you should refrain from drawing strong inferences from correlation matrixes. Use correlation coefficients for explorative purposes,
to refine your intuitions and hypotheses. Once you have understood Pearson’s r and
the explorative potential of correlation matrixes, more robust results will be provided by
linear regression.
Example 11b. Trust in institutions
Trust is a common measurement in social surveys that usually applies to either people in
general or to specific social and political institutions like parliaments, politicians or the
police force. We decided to focus on institutional trust (study: ESS, variables: all starting
with trst, which is coded with an asterisk: trst*) to illustrate how trust correlates between institutions:

112

!
. pwcorr trst*, star(.05)
trstprl

trstlgl

trstplc

trstplt

trstprt

trstep

trstprl

1.0000

trstlgl

0.6674*

1.0000

trstplc

0.5527*

0.6813*

1.0000

trstplt

0.7005*

0.5745*

0.5051*

1.0000

trstprt

0.6657*

0.5499*

0.4667*

0.8691*

1.0000

trstep

0.4776*

0.4240*

0.3576*

0.5307*

0.5418*

1.0000

trstun

0.4441*

0.4354*

0.4199*

0.4888*

0.4961*

0.7175*

trstun

1.0000

The correlation matrix reproduced above shows moderate-to-strong correlations between many institutions. For instance, there is a strong association between the trust
scores of two supranational organizations with ruling parliaments, the European Parliament and the United Nations (r = 0.72). That association is actually less intense between the European Parliament and national parliaments (r = 0.48). Identically, there is
a remarkably strong correlation of trust scores for politicians and for political parties (r =
0.87), but the associations between these two scores and the trust score for the legal
system are less marked, despite the importance of the rule of law in guaranteeing democratic electoral competition.
All observations drawn from correlational analysis are tentative. The most robust findings actually come from the absence of any significant correlation, which can designate
mutually exclusive situations, as in a measure of voting preference for several political
candidates: if constitutional rules are set to organise a uninominal ballot, then the election is a zero-sum game between the candidates and voters are likely to polarise their
opinions and reject all candidates but one.
However, when the correlation matrix shows very strong associations between two or
more independent variables, then you can start diagnosing potential issues of multicollinearity in your future regression model. Multicollinearity is the situation where independent variables influence each other in a significant way that is not captured in your
model, where the focus is instead set onto the dependent variable. Section 11.5 covers
multicollinearity in more detail.

11.4. Interpretation
Linear regression in Stata uses the regress command, followed by two variables for a
simple linear regression and any number of variables for a multiple linear regression.
The list of variables depends on your research design and on the results of your previous bivariate tests.
To interpret correctly a linear regression model, you should use robust standard errors
(using the robust option) and then focus on the following results:

113

!
−

The number of observations reflects the subsample used to perform the regression. This subsample is created by casewise deletion, as explained above. A low
number of observations will limit the validity of the model.
If you are facing a very low number of observations, you will need to remove
the variables that are causing that number to drop, to increase the validity of
the model for your sample population. If one of your independent variables is
available for less than 30 observations, remove it and run your regression without it.

−

The p-value of the model (Prob > F) should be below your alpha level of significance. The separate p-values for the coefficients in your model should obey the
same rule.
The p-values can be read independently: if a categorical variable returns both
high and low p-values on its dummies, interpret them separately. For instance, if
your model includes a variable that defines religious denomination, it might
happen that the variable produces a significant effect only for some religions
and not others.

−

The R-squared statistic indicates the predictive value of your model. It can be
read as a percentage: an R-squared value of .08 indicates that your model predicts the variance of your dependent variable by only 8% (whereas efficient
models will usually predict over 80%).
An issue with the R-squared statistic is that it will mechanically increase with the
number of variables included in the regression model. To control for that effect,
a large number of independent variables. This issue disappears with standardised coefficients, as explained below.

−

Finally, the coefficients for both your variables and your constant (noted _cons
and also known as the intercept) are the parts of the model that you will interpret. To make them comparable, you need to standardise them across variables,
using the beta option.
Technically, the coefficients establish the amount of variation in your dependent
variable that occurs for a variation of one unit in each of your independent variables. This variation can be interpreted straightforward only for continuous data, and requires more thought for categorical data.

[Example suggested by Dawn Teele] AJR's Colonial Origins of Comparative Development, p.1378: "To get a sense of the magnitude of the effect of institutions on performance, let us compare two countries, Nigeria, which has approximately the
25th percentile of the institutional measure in this sample, 5.6, and Chile, which has
approximately the 75th percentile of the institutions index, 7.8. The estimate in column
(1), 0.52, indicates that there should be on average a 1.14- log-point difference between the log GDPs of the corresponding countries (or approximately a 2-fold differ114

!
ence-e1.14-12.1). In practice, this GDP gap is 253 log points (approximately 1-fold).
Therefore, if the effect estimated in Table 2 were causal, it would imply a fairly large
effect of institutions on performance, but still much less than the actual income gap between Nigeria and Chile."
Example 11c. Subjective happiness
The Quality of Government dataset includes a series of indicators that report selfassessed happiness. One these indicators consists in a mixed measure that combines life
satisfaction and a subjective assessment of one’s life, from “best possible” to “worst
possible”. Both measures use standardised psychometric scales that provide an indicator
of individual happiness in the 0–1 range, multiplied by life expectancy at birth (study:
QOG, variable: wdh_lsbw95_05, renamed life for convenience).
The steps that we took prior to running the linear regression model include data preparation, description and visualization. Each step is covered, with comments, in the
qog_reg.do file provided in Appendix A. We then thought of a list of potential predictors for this dependent variable, which can be summarised along the following categories:
−

Security, measured as the absence of social, economic or political crisis

−

Wealth, measured as national affluence, free markets and low corruption

−

Freedom, measured as free speech, gender equality and democratic life

−

Health, measured as high life expectancy and low infant mortality

−

Education, measured as high educational attainment among both sexes

Looking at the variables in the Quality of Government dataset, we found many variables that could fit each part of the model, which is far from perfect—for example, the
‘Health’ component is partly redundant (or collinear) with the dependent variable,
since the happiness indicator is already calculated against life expectancy. Identically,
infant mortality, life expectancy and educational attainment (measured as average
schooling years) are heavily correlated, which would lead to measure the same variable
twice, and since corruption is an obstacle to free markets, it is likely to be measured
twice if we include it as two separate independent variables. Consequently, we removed some variables from the model after looking at a few correlations.
The final model tests three series of variables, corresponding to our three models of
happiness as Security, Wealth and Freedom. The table below summarises the respective
results of the models, and as shown, each of them carry statistically significant results
that can be substantively interpreted. Improvements of the model would include normalizing some of the variables and applying some other diagnostics, quickly reviewed in
the next section.

115

!

Table 1. Estimated Effects of Security, Freedom and Wealth on Subjective Happiness
Model 1
Security
Failed state

Model 2
Wealth

Model 3
Freedom

– 0.69***
(0.03)

Model
1+2

Model
1+2+3

0.02
(0.05)

– 0.05
(0.05)

Gross domestic product

0.39***
(0.00)

0.40**
(0.40)

0.39**
(0.00)

Market governance

0.45***
(2.37)

0.47***
(2.38)

0.46***
(2.26)

Freedom of speech

0.28**
(2.38)

0.25**
(1.91)

Freedom of the press

– 0.25
(0.11)

0.39**
(0.09)

Women’s social rights

0.31**
(1.35)

0.00
(1.10)

Electoral process

0.30
(0.91)

0.52**
(0.77)

Political process

– 0.39
(0.81)

– 0.31
(0.66)

Constant (or “Intercept”)

63.16***
(1.65)

35.93***
(1.62)

37.49**
(11.56)

35.23***
(4.32)

17.00*
(9.03)

Observations (or just “N”)

93

91

94

90

90

0.48

0.66

0.45

0.66

0.73

R-squared

Standardised beta coefficients; robust standard errors in parentheses.
Significance levels: * significant at 10%, ** significant at 5%, *** significant at 1%.

[WRITE: Dummy variables] We might finally want to add a dummy variable to Western
countries to see whether Western democracy really has an advantage over the rest of
the world in terms of the subjective happiness of its citizens. The simplest way to do
this consists in using the xi: reg command with i.ht_region, but the tab ht_region,
gen(region_) command will also work.
[WRITE: Interactions]

11.5. Diagnostics
Linear regression models have the capacity to reveal many different aspects of your data, such as multicollinearity when several independent variables in the model all revolve around the same factor, and thus lead to estimating the same factor several times
through different measurements. For instance, if you include monthly and annual income in your model, you are controlling twice for a single factor, measured in two different ways. Multicollinearity will affect your model by calculating separate coefficients
for “independent” variables that are actually correlated, hence creating an issue in your
116

!
Variance inflation factors (VIF) obtained with the vif command allow to assess for multicollinearity. As a rule of thumb, the factors should stay below 10 (or, alternatively,
their tolerance should stay above 0.1). If your VIF diagnosis finds multicollinearity, your
independent variables include some collinear ones, which will affect the measure of regression coefficients.
Another issue that might affect your regression model is heteroscedasticity, which designates a violation of the normality assumptions under which linear regression operates.
Specifically, linear regression posits homoscedasticity, which stands for equal (or constant) variance in your independent variables. This is an important dimension of any
regression model.
Under equal variance, a plot of your regression residuals should show no pattern
among them. If a nonlinear pattern appears, or if the distribution of the residuals
around the fitted values is simply not close to uniformity, then the data violates the linear assumption of your model. The rvfplot command diagnoses that issue by producing
a plot of residuals-versus-fitted values.
Finally, some formal tests exist to detect heteroscedasticity. The imtest hottest [WRITE]
Each issue will have different consequences on your linear regression model. Multicollinearity will increase the standard error of your coefficients and obscure the interpretation of the results. Heteroscedasticity indicates that a linear model is not reflecting the
data correctly, and that a nonlinear model should be used.
At that stage, you should also detect influential observations using a measure called
Cook’s D, and outliers, using tools akin to those described in Section 9.4. The studentized residuals [WRITE]
[EXAMPLE CONTINUING 11.4]

117

!

12. Cheat sheet
This section summarises the most useful commands used during the course and in this
guide to produce the kind of statistical analysis that is expected to appear in your final
research paper. Part 3 will further explain how to write your draft and final papers.

12.1. Theory
Colour codes: You need to understand these notions for Assignment No. 1; You need
to understand these notions for Assignment No. 2; You need to understand these notions for the final paper. Each topic is featured at various places in the Stata Guide, and
the theoretical notions are explained in depth in your handbook.
−

Datasets: survey design, sampling strategy, sample size, units of observation,
variables, missing observations, categorical (ordinal, interval, nominal) and continuous (ratio, count) variables.

−

Normal distribution: standard error (of the mean), skewness and kurtosis, probability distribution, (standardized) z-scores and ‘alpha’ levels of precision, other
distributions (binomial, Poisson).

−

Univariate statistics: number of observations, mean, median, mode, range,
standard deviation, percentiles, quartiles, graphs (histograms, kernel density,
bar, dot and box plots).

−

Estimation and inference: point estimates, confidence intervals, t distribution,
null and alternative hypothesis testing, p-values, one-sided and two-sided/tailed tests, Type I (false positive) and Type II (false negative) errors.

−

Bivariate tests: t-tests (means comparison), proportions tests, Chi-squared tests
and Cramér’s V (independence or association), correlation (Pearson and Spearman), Gamma test.

−

Regression: simple and multiple linear regression, unstandardized and standardized ‘beta’ coefficients, F-statistic, t- and p-values, R-squared, dummies, interactions, diagnostics (residuals, multicollinearity, homoscedasticity, influence).

−

Statistical issues: sampling frames, survey weights, variable measurement and
bias, normality assumptions, correlational ≠ causal analysis, statistical ≠ substantive significance.

−

Scientific writing: research design, paper structure, scientific style, tables and
graph formatting, referencing (sources and citations), discussion.

118

!

12.2. Data
Data management is covered in Sections 5–8. Transforming variables needs to be done
carefully, and takes a lot of time, especially when you are new to the data that you are
−

Use the datasets that we recommend for this course. If you do not have a dataset ready well in advance for Assignment No. 1, fall back on ESS, GSS, WVS
and QOG data to find variables of interest.

−

Your dependent variable is a single, continuous measurement, such as the average Body Mass Index of American adults or the percentage of women in national parliaments around the world.

−

Select a handful of ‘IVs’ of any type, such as age, gender, GDP per capita or
welfare regimes, that you think can explain the distribution of your ‘DV’.

−

Never save over a clean dataset. Do not save your changes to the data! This
makes them impossible to retrace properly. Instead, post all information you
need to get replicated—rather than published—in the do-file.

−

drop and keep (to select observations and variables)

−

generate (to create new variables like sums or indices)

−

rename and label variable (to rename your variables with convenient names
that are short and understandable, and to assign them labels)

−

recode and replace (to recode data into simpler categories, such as dichotomous
variables with binary outcomes)

−

encode and destring (to convert string data into numeric format)

−

label define (to create value labels), followed by label values (to assign label to
the values taken by your variables)

Remember: poorly prepared datasets are much more complex to analyse, and even
harder to understand for others. As a rule of thumb, try to apply these principles:
−

Renaming your variables to humanly understandable words or acronyms is very
helpful, as long as you keep the new names short and to the point;

−

Proper labels should be assigned to both variables and values, so as to make
sense of categories, cut-off points and so on.

−

Survey weights should be documented, even if we do not use extensively. Your
do-file should include the svyset command documented in the ‘readme’ text file
distributed with each course dataset.

119

!

12.3. Distributions
Start by describing your variables as sown in Section 9. Univariate descriptions of your
data will appear in your table of summary statistics. Some commands follow:
−

tab, tab1 or fre (to tabulate categorical variables; the additional command fre is
recommended because of its better handling of missing observations)

−

summarize (to produce a five-number summary for continuous data: number of
observations, mean, standard deviation, min and max)

−

summarize with the detail option (to further summarize continuous data with
quartiles and percentiles as well as skewness and kurtosis)

−

tabstat with the n mean sd min p25 p50 p75 max options (to further summarize continuous data with quartiles); alternative command: univar.

Continue by graphing the distribution of your variables when you are able to comment
meaningfully on the distribution in your paper:
−

graph hbox (to describe the distribution of continuous data, showing quartiles
and outliers)

−

histogram with the kdensity and normal options (to describe the distribution of
continuous data; use other units if relevant)

−

histogram with the discrete option (to describe the distribution of categorical
data that support an interval or ordinal scale)

−

catplot (to describe the distribution of categorical data; this additional command
is rarely more useful than a frequency table obtained with tab1 or fre)

Remember: significant labels (usually percentages) will make your graphs much more
informative, so use xtitle and ytitle to assign titles to axes, ylabel to modify the units
and labelling of the y-axis, and specify frequency or percent to use these units of measurement instead of density in histograms.
Finally, since independence tests are generally based on the presumption that your variables are normally distributed, you should test this distributional assumption by looking
for possible transformations of your variable, using diagnostic plots:
−

symplot (to assess symmetry)

−

qnorm and pnorm (to assess normality at the tails and at the centre)

−

The summarize command with the detail option provides skewness and kurtosis: zero
skewness denotes a symmetrical distribution, and normal kurtosis is close to 3. A variable might approach normality when transformed to its logarithm (log or ln), or sometimes its square root (sqrt) or square (^2). Only a full, visual check using the commands
above can determine the relevance of a transformation.
120

!

12.4. Association
Your paper is articulated around rejecting the null hypothesis about the absence of association between your dependent and independent variables. There are many tests
available to do so, but you will use a short list of independence tests:
−

ttest with the by option (to compare two means, using continuous data)

−

prtest with the by option (to compare two proportions, using discrete data)

−

tab with the chi2 option (to crosstabulate categorical data)

−

tab with the exact option (to crosstabulate on low cell counts or ‘2 x 2’ tables)

−

pwcorr with the obs, sig and star options (to build a matrix of significant correlations at a certain level of significance)

The tests do not use the same method: comparison tests use confidence intervals, as
also provided by the ci or prop commands.
You should represent significant relationships graphically:
−

sc with two continuous variables; use the mlab(country) option to label data
points with the values of the country variable, as you might want to do with
country-level data or other data with identifiable observations.

−

spineplot for two categorical variables; make sure that you install this additional
command and train yourself to read it correctly, as it is arguably the most useful
way to plot two categorical variables against each other.

−

gr dot or gr hbar for a continuous dependent variable and a categorical independent variable; do not publish these graphs unless you have solid grounds to
think that a categorical plot can convey more information than a table.

12.5. Regression
You can use linear regression as long as your dependent variable is continuous. If your
dependent variable represents a binary outcome, another model applies, using the logit
command. The most useful commands for linear regression are:
−

sc with two continuous variables (to visualize a possible linear correlation)

−

tw (lfitci dv iv) (sc dv iv) with two continuous variables (to visualize correlations
with their linear fit and confidence intervals)

−

reg dv iv1 iv2, … with 2+ continuous variables (to run simple or multiple linear
regression models)

−

reg dv i.dummy (to add dummy categorical independent variables; add xi: in
front of the command if running Stata 10 or below)

121

!
−

char dummy[omit] value (to set the baseline, or reference group, for dummy
categorical variables; a recode is usually simpler/better, though)

−

the number of observations (the data on which the model ran)

−

the t-values and p-values (whether the model is statistically significant)

−

the R-squared (a gross measure of the predictive value of the model)

−

the coefficients (the amount of variance predicted by one unit of the predictor)

−

the standard errors, t-values, p-values and confidence intervals (the reliability
of the coefficients produced by the model)

12.6. Programming
Remember some very basic Stata operating procedures:
−

Setting the working directory is not an option. For this course, always check
that you have selected the SRQM folder as your working directory.

−

Remember to install additional packages like fre, spineplot or tabout, or you
will run into errors when calling these commands.

−

The commands of a do-file should be run in sequential order: if you try to execute line #30 before line #10, you are likely to encounter an error.

−

Get used to running multiple commands at once: select them and use Ctrl-D on
Windows or Cmmd-Shift-D on Mac OS X to run them altogether.

The following tricks are for readers who have little or no experience with programming
languages. Skip them if you are acquainted to programming environments.
−

To execute a command and ignore breaks in case of an error, use the capture
prefix command (shorthand cap), as in cap drop age (which will drop the age
variable and just do nothing if the variable does not exist).
This option will come in handy with lines such as cap log close, which will return an error if no log file is open. Using cap is then a safety net to ensure that
the do-file will run regardless of a log file being open.

−

Run all adjacent commands connected with ‘///’ together. To break long
commands onto several lines, use /// or set a line delimiter with the #delimit, or
just #d, function; for instance, type #d ; to write in pseudo-C++ syntax.
The first option is very useful when you are dealing with code that extends beyond the limits of your do-file editor window. This will happen often if you are
coding graphs with many options.

−

To apply a command to several variables with a wildcard operator, use the *
symbol, as in su trst* if you want to summarize all variables starting with the
122

!
trst prefix, or destring party*vote to convert all variables named party1vote,
party2vote, …
This trick comes in handy if you have created binary variables from a categorical
one. For instance, the tab relig, gen(relig_) command to create dummies of religious beliefs, the relig_1, relig_2,… relig_n variables can be designated together by typing, for example, ci relig_*.
−

To use loops, read the foreach and while documentation. Stata code is Turing
complete, so it handles your needs. Use them if you know how loops generally
function in programming environments; ignore them otherwise.
The simplest use of a foreach loop is when you are recoding a bunch of variables that all follow the same coding rules. In that case, you can place the recoding operations in a foreach loop to save time and code.

−

To use ‘macros’, read the local and global macros help pages. Use macros only
if you have sufficient programming experience in another language where you
already learned to use macros, constants or scalars.
A basic example of a global macro is one that stores recurrent graph options,
which you can then call in only one word. A basic example of a local macro involves counters, which should not come up in this course.

Examples of each trick will show up in portions of the course do-files that are not
shown in class, but included for you to explore while replicating the session. None of
these tricks are part of the course requirements.
*
My tentative conclusion to the course follows.
For empirical social scientists, statistical reasoning and quantitative methods provide an
additional layer of theory and methods to their sociological skill set, which also includes
a fair share of history, philosophy and various forms of intuition.
This layer is meant to enhance their capacity to make causal claims about observable
relationships in the social, material world. Just like any compound of theory and methods, the ‘SRQM’ layer is a thick one that is not easily digested: in fact, you will have to
ruminate it a lot, in the same way that you should be feeding on data and secondary
analysis to ground your own work.
You can be an intellectual fox or hedgehog, as Isaiah Berlin once offered to, but if you
want to be an empiricist, you will have to be a ruminant too. Please do not be offended
when I say that this guide is actually a cookbook for hungry sociological herbivores.
And remember: your input to improve this guide is needed!

123

!

Part 3

Projects
The Assignments for this course will lead you towards submitting your final paper, in
which you will expose your research project. During the weeks of class, you will write
up two Assignments and one final paper, following instructions that are carefully described in this section. Read these in detail.
towards producing your final paper. The grading ranges from 1 (critical revisions needed) to 5 (cosmetic revisions needed).
The table below shows how Assignments progressively lead to your final paper.
Assignment 1

Assignment 2

1. Research question
Identify a research
topic, a dataset, and
select some variables.

Revise and update Assignment No. 1
before moving to steps 3 and 4.

2. Univariate tests
Identify the values
and distributions of
the variables used in
the research.

3. Bivariate tests
Crosstabulate the variables and run significance tests for the
crosstabs.
4. Regressions
Run regression models, report their output, and discuss the
results.

124

!

13. Formatting
Section almost entirely consistent with the course in its current form. Review. Notably, Tables and Graphs could be presented in another division, “Exports” and
“Presentation”.
Applying the formatting conventions below will ensure that your paper uses a presentational standard that comes close to academic practice. Do not consider your paper an
academic one without some attention to presentation, which are necessary to help the
Intelligent formatting will require exporting, styling and commenting on tables and
graphs. This will make up for a substantial fraction of your word limit, and of your paper overall. If followed, these recommendations will hence help you build your paper
methodically, which also helps with reasoning. QED.

13.1. Communication
When sending your Assignments, you must carefully respect some conventions that
apply to your email and attached files. Your Assignment emails should be structured as
in the following example:
From

To

Always email to both course instructors.

Subject

SRQM: Assignment No. 1, Briatte and Petev

Body

Attachments

Assignment

BriattePetev_1.pdf

Do-file

BriattePetev_1.do

Dataset (optional)

Attach your data in DTA format if you
chose to work with a dataset outside of
recommended ones.

13.2. Files
Your Assignment is a text file written following a set of scientific conventions. Follow
the instructions provided in class to format your document, and print it as a PDF file
using any common utility to do so.
125

!
Your do-file must be executable: try running them to make sure that they do not produce errors. Use the template provided in class to do so, and do not hesitate to imitate
the do-files from the course sessions.
Your dataset must be provided for replication purposes: if it does not feature in the datasets recommended for the course, send it along as a Stata data file with a ‘.dta’ extension, converting it to that format if needed (Section 5), and making sure that it is
ready for cross-sectional analysis (Section 7).
Important: if your dataset ranges over ~ 5MB, compress it as a ZIP or RAR file, without
any form of password encryption. If your data still ranges over ~ 10MB after subsetting
and compression, use sendspace.com to email it over. Please do not use any other format or service, as to avoid unnecessary confusion.

13.3. Text
Your Assignment is not a string of Stata output pasted into a text file. The last sections
of this guide (and Section 16 in particular) set out instructions to format your final paper as a scientific one, but more generally, remember that academic writing involves
using precise, unambiguous terms in correct, simple sentences. Beyond issues of vocabulary, grammar and syntax, you should also use a text structure made of a small number of balanced paragraphs and sections.
Data and procedures that come with no explanation are useless to the reader. In your
work, tabular and graphical data visualizations (tables and figures) should be supported
by substantive text, including a title, a legend and some explanatory notes, either in
your main text or as captions. More on this below.
In many ways, the same recommendations apply to the code in your do-file. As mentioned in Section 2, computer languages work a lot like human languages. For instance,
linguistic diversity also applies to programming languages: Stata code is only one of
thousands of different ones, each of which possess their own syntax rules and end up
forming ‘families’ of languages, with shared properties and a more or less dynamic
community of more or less proficient users, etc.
Code is fundamentally text, and computer code obeys common linguistic rules. For
example, you have noticed that Stata code supports abbreviations, with some commands coming in two forms (a ‘full form,’ such as summarize, and a shorthand form,
such as su). Pushing that observation further implies that there such a thing as writing
‘clear’ code, just like there is ‘clear’ writing.
This notion is often called ‘literate programming’, as computer scientist Donald Knuth
termed in 1984. Knuth offered to “change our traditional attitude to the construction
of programs: Instead of imagining that our main task is to instruct a computer what to
do, let us concentrate rather on explaining to human beings what we want a computer
to do.”

126

!
Quoting again from Knuth, that mind-set carries one important lesson: “The practitioner of literate programming can be regarded as an essayist, whose main concern is with
exposition and excellence of style.” Breaking it into simple rules that fit this course, the
three most important ideas of ‘literate programming’ that apply to your work and will
convey sense to your code go as such:
−

Syntax and vocabulary are (almost) inflexible. Computer code is less versatile
than human language, and deviations in the syntax or terms that you use will
generate errors. Eliminating these errors are the first aspect of ‘debugging’ a
program, that is, correcting its content to make it behave properly. Your code
should be flawless by that standard.
Section 2.6 explains how to deal with a command that returns an error caused
by faulty syntax or by a spelling mistake. More complex aspects of debugging
involve, for example, the precise order in which you execute your commands.
Hopefully, the linear structure of Stata code should reduce issues in that category to their minimum.

−

Complexity calls for annotations. As soon as we express difficult thoughts, we
add a ‘meta’ layer of information to our text, as with side notes in religious texts
like the Talmud, stage directions in theatre plays, or bracketed information and
footnotes in other kinds of texts.
In computer code, comments serve precisely the same elucidative purpose. They
command that you use, and you should be able to distinguish which parts of

−

Sectioning is not an option. Cartesian thought has that in common with playwriting that it uses blocks, like Acts and Scenes. Think of philosopher Ludwig
Wittgenstein’s Tractatus Logico-Philosophicus, which uses hierarchical numbering to outline seven propositions, or just think about the structure of any book
or newspaper, with page numbers, paragraph spacing, text columns, sub-heads
and running titles.
Your code should also feature a simple sectioning, with comments and blank
lines used to create blocks of code where commands are grouped in relevant
conceptual blocks, such as setup, recoding, descriptive statistics, analysis with a
first independent variable, and so on.

13.4. Tables
Quantitative evidence requires producing tables of summary statistics for many operations such as variable description, correlation or regression modelling. Stata output can
be exported by copying it from the Results window, using the ‘Copy as Picture’ func-

127

!
tion. This method might be acceptable for presenting draft texts, but not for final written communication.
There are two main reasons to this, beyond aesthetic reasons:
−

First, Stata output usually contains more information than required for a standard paper. Scientific communication is parsimonious and only requires a selection of summary statistics, while Stata is more exhaustive and sends larger

−

Second, Stata output reports a high level of precision by including many digits
after the decimal point, which is inconsistent with the limited precision of initial
measurements. For example, the height variable of the NHIS dataset is reported
by the su command as having a mean of 66.68131 inches, a precision level that
does not reflect real measures.

It is therefore expected that you do not copy and paste from your Stata output to report your results, but instead convert it to tables that follow some standard formatting
conventions:
−

Round all results to one or two decimals: After exporting your results to a
a rounding function to truncate numbers.

−

Ideally, format your table as to align columns to decimal tabs. Centring your
tables on the decimal separator “.” helps with reading your results. Unfortunately, not all word processors manage to do this well.

−

Ideally, provide notes with your table. Your table can use footnotes to comment on its contents. These comments can either appear in your text, or better,
as footnotes immediately below the table.

Because word processors are variably competent or explicit about the two latter conventions, they are only optional here. However, for the most dedicated, brief instructions appear in this guide: http://people.oregonstate.edu/~acock/tables/center.pdf.
Start by exporting your tables using the method described below, which deals first with
continuous variables, and then with categorical ones:
−

First, install the tabout command, then run these commands by replacing dv,
iv1, iv2 and iv3 with your dependent and independent variables, as long as
they are continuous:
* Produce a standard summary statistics table.
tabstat dv iv1 iv2 iv3, s(n mean sd min max) c(s)
* Export to CSV file.
tabstatout dv iv1 iv2 iv3, tf(stats1.csv) s(n mean sd min max)
c(s) f(%9.2fc) replace
128

!
The CSV file, which will require that you import it in a spreadsheet editor like
Microsoft Excel, contains a table of summary statistics that can be easily imported or copied and pasted into your text processor.
−

Second, run this command on your categorical variables. The command will export a frequency table in percentages:
* Export to CSV file.
tabout iv4 iv5 iv6 using stats2.csv, replace

The files stats1.csv and stats2.csv should have been created in your working directory
(Section 3.3), and can be imported or copied and pasted into a word processor. Both
files are just working files and need not to be sent with your do-file.
A useful way to compact both files into one, in order to produce only one table, is to
stick the frequency percentages of your categorical variables into the same column as
the mean values of your continuous ones. An example of that arrangement is shown
below in Table 3.
The following example describes a dataset and its main variables of interest for its
units of observation, N = 10 African countries, based on a recent journal article: Kim Yi
Dionne, “The Role of Executive Time Horizons in State Response to AIDS in Africa”,
Comparative Political Studies 44(1): 55–77, 2011.
Table 1. Countries of analysis
Country

HIV prevalence
(2001)

GDP
(2001)

Ethiopia

4.1

724.80

Mozambique

12.1

1018.82

Rwanda

5.1

1182.49

Zambia

16.7

816.83

Tanzania

9

547.27

Burundi

6.2

618.10

Uganda

5.1

1336.33

Lesotho

29.6

2320.70

Namibia

21.3

6274.3

8

1016.18

Kenya

Note: inspired from Yi Dionne (2011).

The next examples describe the summary statistics for the Quality of Government dataset. By convention, “N” designates the total number of observations, and “SD” is the
acronym for the standard deviation.

129

!

Table 2. Summary statistics
Variable
Infant mortality rate (per 1000 live births)
GDP per capita (logged)

a

Government health expenditure (% GDP)

N

mean

SD

min

median

max

181

43.30

39.97

2.80

27

166

192

84.67

715.84

0.01

1.93

6243.05

76

7.57

1.56

4.58

7.44

11.20

Note: data from Quality of Government (2010).
a
Variable transformed to natural logarithmic scale.

If you need to include categorical variables, use the “Mean” column to indicate the valid percentages for each category, as follows:
Table 3. Summary statistics
Variable

N

mean

SD

min

median

max

Infant mortality rate (per 1000 live births)

181

43.30

39.97

2.80

27

166

Government health expenditure (% GDP)

76

84.67

715.84

0.01

1.93

6243.05

192

7.57

1.56

4.58

7.44

11.20

Monarchy b

13

6.99

Military

12

6.45

One-party

7

3.76

Multi-party

56

30.11

Democracy

89

47.85

GDP per capita (logged)

a

Regime type

Note: data from Quality of Government (2010).
a

Variable previously transformed to natural logarithmic scale on grounds of normality.

b

Includes only nondemocratic monarchies; cf. QOG Codebook (2011), p. 34.

The next example describes crosstabular output. By convention, independent variables
are displayed in rows, with column percentages. The example below provides both frequencies and percentages, but it is common not to indicate the frequencies when these
reach high counts.
As you will notice, the table is not very helpful, as it uses the recoded versions of two
continuous variables. When the data are available as continuous variables, scatterplots
are always preferable to crosstabulations, especially when displayed with the linear fit of
a simple linear regression.

130

!

Table 4. Crosstabulation of GDP and HIV
HIV prevalence
Low

Medium

High

Total

GDP per capita

n

%

n

%

n

%

n

%

Low

1

25.0

3

37.5

0

0.0

4

26.6

Medium

3

75.0

3

37.5

1

33.3

7

46.6

High

0

0.0

2

25.0

2

66.6

4

26.6

Total

4

100.0

8

100.0

3

100.0

15

100.0

Note: adapted from Yi Dionne (2011), replication dataset.

The final example describes multiple linear regression output. The important variables
to include are coefficients, standard errors, the constant (or intercept), the total number
of observations – or N – and the R-squared, as well as the starred p-values. The table
below reports multiple linear regression for three dependent variables and uses three
levels of significance (0.1, 0.05 and 0.01).
Table 4. Estimated Effects of HIV rates and GDP on API policy scores
API
Policy

Health
Spending

AIDS Spending

Log HIV prevalence

7.89
(10.03)

5.77
(4.25)

1.67
(1.28)

Log GDP per capita

-7.59
(8.04)

-2.51
(3.41)

1.97*
(0.97)

Constant (or “Intercept”)

96.97***
(21.75)

12.22
(9.22)

-21.48***
(2.7)

Observations (or just “N”)

15

15

14

R-squared (or fancily“R ”)

0.08

0.13

0.52

2

Note: adapted from Yi Dionne (2011), Table 3. Each column corresponds to a different model,
and starts with the name of the dependent variable. Standard errors in parentheses.
* p < .1. ** p < .05. *** p < .01.

As indicated above, it is good practice to add a table summary after your regression
output. The summary must report whether the results confirm or contradict the predictions of the models (your hypotheses), as shown by the full table summary written by
the author:
Table summary: Contrary to what I hypothesized, longer time horizons are associated with lower values on the API policy and planning score, meaning less
AIDS intervention (API Policy). As predicted, longer time horizons are associated
with higher government expenditures on health (Health Spending). However,
no inferences can be made with this data about the role of executive time horizons on domestic spending for AIDS programs (AIDS Spending).

131

!
The extracts I underlined relate to the author’s predictions. Her summary not only describes how her hypotheses did against the data, but also describes the parts of her research that did not yield any significant results. Note that the table summary interprets
parts of the table that are not reproduced here.

13.5. Graphs
Section 9.1 describes how to add options to your graphs in order to make them more
than a well-formatted table.
Exported graphs should use either PNG or PDF format. Export can be performed in
more than one way, so read carefully:
−

The simplest way to produce a graph in Stata is to run a graph command and
then save the results of the ‘Graph’ window by using the Save item of the File
good enough, as it does not create the actual graph file and skips the part
where you can choose its format.
This method is alright if your graph command appears in your do-file, and if the
graph is stored in Stata memory along the way. To do so, add the name() option with the replace sub-option to the graph commands in your do-file that
produce a graph included in your paper:
* Histogram of the dependent variable, with saving options.
hist dv, freq normal name(histogram_dv, replace)

−

The safest way to export graphs is to include the graph export command instead of copying and pasting, as to create graphs on the fly, when your do-file
is running. The example below illustrates the name(, replace) and graph export
commands:
* Histogram of the dependent variable, with saving options.
hist dv, freq normal name(histogram_dv, replace)
* Exporting histogram_dv.
gr export histogram_dv.png, replace

Basic formatting rules then apply:
−

Be parsimonious. Your do-file will produce more graphs than you will end up
including in your final paper: most of these graphs are used for visual exploration, but need not be included in the final stage of analysis. Section 16.2 restates that point.

−

Explain your graph. Do not consider your job done until the graph has a title,
possibly a caption (as with the footnotes in a table), and a clear reference point
in your text that cites your graph as either “Figure 1” or “Fig. 1” (use any, consistently), and explains your graphical results.
132

!

14. Assignment No. 1
Section almost entirely consistent with the course in its current form. Use the example
and template files as well as the course slides for more exhaustive instructions.
Please make sure that you read this instruction sheet in full. The grade for this Assignment will refer to every instruction to assess your capacity to work in a quantitative
environment, including both stylistic and substantive issues. Also read Section 13 on
formatting your work before starting this Assignment.
Why work hard, if at all, on your Assignment?
−

Take it as a challenge. Unlike jazz music, quantitative data are very much inert
by nature and will require that you put some life into them. Explore, tabulate,
describe it… Take possession of your data!

−

Something important is happening. Quantitative data are currently affecting
both our vision of society and the core tenets of social science. Join the scientific
revolution!

−

Your work is cumulative. You will be able to use this Assignment to write up
your final paper, which underlines the absolute need for regular work and practice during this course. Believe me: other methods will fail.

Welcome to the world of quantitative methods, and good luck!

14.1. Research design
Start by describing your research question and hypotheses, using approximately 15
lines of text. While this step would require consulting the scientific literature on your
research topic as to derive your hypotheses (or predictions) from the findings of previous studies, you do not have to produce a literature review for this course, and should
Here are a few examples of potential topics, in addition to those sent by email or mentioned in class:
−

Differences across time and/or countries in attitudes toward: religion, inequality, homosexuality, marriage, immigration… The European Values Survey and
the World Values Survey hold a large sample of questions on these themes and
on many others.
Use your general knowledge and curiosity to come up with questions. Who are
the individuals who declare being optimistic about their future? How does income and health influence other aspects of wellbeing? Is public opinion split on
topics such as climate change or euthanasia?

133

!
−

Changes in political factors such as party identification and left/right political
cleavages, political regimes, voting systems… The political science literature
holds virtually thousands of ideas in that domain. Start by checking the ICPSR
data repository on such topics.
Remember that polling data is produced for many political events and policies.
Who supported the invasion of Iraq, and who would support military intervention in Iran to stop nuclear proliferation? Is there public support for the use of
torture when interrogating terrorists?

−

Social determinants of income inequality, poverty and crime rates, health, educational attainment and so on, at a geographic scale for which data are available. The European Social Survey documents some of these aspects for Europe,
but there are thousands of datasets available.
As an example, think of how some issues are unequally distributed among age
groups and between men and women, such as drug abuse, career advancement, homicide, geographic mobility, traffic injuries, unauthorised digital filesharing, depression, alcohol consumption, etc.

−

Observed effects, both positive and negative, of entry in the European Union or
transition to democracy, or… on demographics, life expectancy, economic performance, crime rates, green technology, social mobility… Evaluations and experiments are conducted on all sorts of events.
To take a few examples, fascinating academic studies have documented the effects of the Cultural Revolution on mental stress and cancer rates in China, on
the effects of class size on educational performance, or on the benefits and
costs of public-private partnerships.

−

Be imaginative! For example, Jane Austen wrote in Pride and Prejudice (1813):
“It is a truth universally acknowledged, that a single man in possession of a
good fortune, must be in want of a wife.” To what extent was Jane Austen
right? (Thomas Hobbes is also a great source of research questions, but they
tend to be much more pessimistic about human nature than Jane Austen was.)

Again, be imaginative: your personal interest and ideas are crucial to the task. Data are
collected and made available at all levels of society, from municipal districts in urban
studies to the whole planet in international relations, on all sorts of topics. Try to review
a maximum of options, but know your limits: some (interesting) questions are out of
our range of skills for this course. Do not use time series data, and select a continuous
or ordinal dependent variable.
A real-life example of model description goes like this. Take a close look at the scientific
style of the description, especially when it comes to the description of predictions and
variables, since you will be borrowing from that style of writing in your own Assignment:

134

!
The dependent variable in the model is the size of government. The concept is
measured, following the example of several previous studies (e.g., Alesina &
Perotti, 1999; Poterba & von Hagen, 1999), by total government outlays as a
percentage of GDP. The measure has considerable face value. However, to test
the robustness of the findings, a second model will be estimated using total revenue of all levels of government as a percentage of GDP (Cameron, 1978; Huber et al., 1993) as the dependent variable.
… However, there are also several control variables that need to be included in
the study. It has been argued that the size of the public economy of a country is
determined by its economic openness (see Alesina & Wacziarg, 1998; Cameron,
1978; Rogowski, 1987). A similar statement has also been made in reference to
the size of the welfare state (Katzenstein, 1985). The logic behind the argument
is the following: If more open countries are more vulnerable to exogenous
shocks such as shifts in their terms of trade with world markets and if government spending is capable of stabilizing income and consumption, then more
open countries will need a larger government to play a stabilizing role. The economic openness control variable is measured by the ratio of imports and exports
to the GDP. Institutional models of the size of the public economy have also
stressed the impact of the federal, institutional structure of government (Cameron, 1978; Schmidt, 1996)… Therefore, one might expect nations with a federal structure of government to have larger public economies than countries
with a unitary structure. Linked to this explanation is another aspect of the institutional structure of government: the degree of fiscal decentralization (Cameron, 1978). In accord with the previous statement, the relatively decentralized
nations should have a larger scope of public economy… In any event, despite
the confusion about the direction of the association, federalism has been identified as an important explanatory variable for government size. Lijphart (1999)
created an index measure of federalism and fiscal decentralization ranging from
1 to 5. This measure will be included in the analysis as a control for the effects
of government structure.
Source: Margit Tavits, “The Size of Government In Majoritarian And Consensus
Democracies”, Comparative Political Studies 37(3): 340–359, 2004 (footnotes
You should have noticed that the hypotheses are often directional predictions, which
are often written as positive relationships such as: “if x increases, y is also likely to increase, and conversely”, or as negative relationships, as in: “x and y are expected to be
inversely related, with y being likely to decrease when x increases, and conversely”. The
course slides hold advice on hypothesis writing.

135

!

14.2. Dataset description
Add to your Assignment a description of the study and dataset that you will be using to
address your research question. For example, the sampling strategy and variable description for the NHIS data that we use to inspect the Body Mass Index of U.S. residents would go as such:
The National Health Interview Survey (NHIS) is a multipurpose health survey
conducted in the United States by the National Center for Health Statistics,
which is part of the Centers for Disease Control and Prevention (CDC).1 It uses
a multi-stage probability sample that includes stratification, clustering and oversampling of racial/ethnic minority groups; it forms a representative sample of civilian, non-institutionalized populations living in the United States, and its total
sample population is composed of 251,589 individuals from the 2000–2009
survey years, which was reduced to N = 24,291 by subsetting the data from the
2009 survey year.
The dependent variable will be the Body Mass Index (BMI) of the respondents,
which we constructed from available measures for weight and height (variable
bmi). We also recoded the BMI variable into its seven official categories, which
range from severely underweight to morbidly obese (variable bmi7).
The independent variables are sex, age, education level, health status, physical
exercise (vigorous and leisurely activity), race and health insurance status. We
expect to find higher average levels of BMI for males, for older and less educated people, as well as for racial groups that are either less educated or less likely
to be covered by health insurance, which is why we included race as a control
variable. We expect BMI to decrease with wealth, education and frequent physical activity.
1

Source URL: http://www.cdc.gov/nchs/nhis.htm.

As mentioned in class, data discovery is a skill in itself, and a time-consuming task
moreover. Identifying and fully describing a dataset is never a question of a few
minutes: you will need at least a couple of hours to locate, download and explore a selection of datasets in order to make your final choice.
If you have absolutely no previous experience with quantitative analysis, you should
begin with the European Social Survey (ESS) if you are interested in measuring social
attitudes. If you are more interested in country-level data on political and economic
topics, the recommended datasets is the Quality of Government (QOG) dataset.
Include some critical perspective on the data. Your data are limited in precision, due to
issues of measurement: for instance, gross domestic product (GDP) is a measurement
(or proxy) of national wealth that does not reflect inequalities in income concentration.

136

!
Briefly document any issues with variables and variable measurement in a few lines, after reading from the codebook.
Check how your dataset is constructed. Typically, your dataset should hold homogenous units of observations in rows, variables in columns. It should also contain only
one year of cross-sectional data. Use the subsetting procedures described in Section 7
if your dataset contains data over several years.
Your description should cover the sampling strategy used by the survey, its unit of
analysis and the size of the sample (its total number of observations). Additionally, you
should provide the full source for your data (authors, URL…). All in all, your dataset
description should not range over 10 lines of text. Writing these lines will require that
you spend some time reading from the documentation that comes with your dataset.
A real-life example of dataset description goes like this. Again, take a close look at the
scientific style of the description:
To investigate the sources of ethnic identification in Africa, we employ data collected in rounds 1, 1.5, and 2 of the Afrobarometer, a multicountry survey project that employs standardized questionnaires to probe citizens’ attitudes in new
African democracies. The surveys we employ were administered between 1999
and 2004. Nationally representative samples were drawn through a multistage
stratified, clustered sampling procedure, with sample sizes sufficient to yield a
margin of sampling error of ±3 percentage points at the 95% confidence level.
Our data consist of 35,505 responses from 22 separate survey rounds conducted in 10 countries: Botswana, Malawi, Mali, Namibia, Nigeria, South Africa,
Tanzania, Uganda, Zambia, and Zimbabwe.
Source: Ben Eiffert, Edward Miguel and Daniel N. Posner, “Political Competition
and Ethnic Identification in Africa”, American Journal of Political Science 54(2):
494–510, 2010 (footnotes removed, highlighted text added).

14.3. Variable description
Before starting your variable description, make sure that your dataset holds at least 30
valid (non-missing) observations for your independent and dependent variables, otherwise you will need to identify other variables to run a statistically robust analysis. Use
the su and fre commands to run these checks.
Start by describing your dependent variable (the variable that your research will aim at
explaining) in detail. Your dependent variable must be either continuous or ordinal (or
binary if you can handle a bit more theory in later sessions). Check with the codebook
of the study for the exact wording of the question if it comes from a social survey, or
check for the definition of the indicator if it comes from a country-level study. If the
variable is measured on an ordinal scale, describe the range of possible values.

137

!
Always check the precise coding of your variables. For example, if you have selected
an ordinal variable that is coded in reverse scores, such as 1 for “Best” and 5 for
“Worst”, you can install and use the revrs command to reverse it into an intuitive scale.
Also make sure that you how missing values are coded for each of your variables (the
natural Stata coding will be “.”).
A real-life example of dependent variable description goes like this. Again, take a close
look at the scientific style of the description:
The main dependent variable we employ comes from a standard question designed to gauge the salience for respondents of different group identifications.
The question wording [is] as follows:
“We have spoken to many [people in this country, country X] and they
have all described themselves in different ways. Some people describe
themselves in terms of their language, religion, race, and others describe
themselves in economic terms, such as working class, middle class, or a
farmer. Besides being [a citizen of X], which specific group do you feel
you belong to first and foremost?”
As noted, a major advantage of the way this question was constructed is that it
allows multiple answers and thus permits us to isolate the factors that are associated with attachments to different dimensions of social identity. We group respondents’ answers into five categories: ethnic, religion, class/occupation, gender, and “other.”
Note: in this example (taken from the same source as above), the dependent variable is
a nominal variable, which is strictly categorical and not continuous. For this course, you
should not use categorical variables, but rather focus on continuous or on ordinal variables (which we will treat as continuous).
Continue your dependent, continuous variable description by describing its distribution
and potentially transforming it to a more normal distribution, using the procedures
shown in class and covered in Section 9.
Finish by briefly describing your independent variables, using tables of summary statistics (see Section 9). No graphs should be required for these variables.

14.4. Programming
Your first Assignment must come with a do-file. You will have to write the do-file and
then run it (execute it) to produce a log file, as described in Section 3 and Section 3.6 in
particular.

138

!
Your do-file should not contain any errors: it should run until its end without stopping
(breaking) because of a mistake in your code. This will require that you test your do-file
multiple times to debug it (correct mistakes).

14.5. Reminders
−

Do not panic. Work regularly, and you should be fine. It is foreseeable that
some aspects of either statistical analysis, quantitative methods or Stata procedures will get you lost, especially if you are learning about these topics for the
first time. If you work every week along the basic course schedule described in
Section 1, you will be fine.

−

Always read the replication sets from each course session, as provided on the
course website. The course do-files show you every single procedure that you
might need to use for your own research, which means that the answers to your
problems are probably just a few clicks away from where you are standing right
now.

−

Get the formats right. Your email should contain all your files, and should be
sent along the instructions mentioned in Section 13. Following these instructions
really is a standalone skill. Basic psychology teaches that grading instructors like
it a lot when students follow all instructions, and the reverse statement is likely
to be also true.

Again, good luck, and see you soon!

139

!

15. Assignment No. 2
Section almost entirely consistent with the course in its current form. Use the example
and template files as well as the course slides for more exhaustive instructions.
Please make sure that you read this instruction sheet in full. The grade for this Assignment will refer to every instruction in order to assess your research skills in a quantitative environment, including both stylistic and substantive issues. Also read Section 13
on formatting your work before starting this Assignment.

15.1. Corrections
Assignment No. 1 was composed of a text file and of a do-file, and was the first draft
that you submitted towards your final paper. Assignment No. 2 follows the same logic
Assignments in this course are not standalone test grades: they are cumulative writings
Assignment No. 2 builds on your previous Assignment and will bring you just one step
from writing up your final paper. Before doing so, it is essential that you fully revise
Assignment No. 1 before ‘upgrading it’ to Assignment No. 2. Please refer to Section
14 and to the feedback on your Assignment to make sure that you have done so.
Here are some common mistakes that often appear in early do-files, and which you
should immediately correct:
−

Analysing ESS data without survey weights. If your research design relies on
data from the European Social Survey, you will need to weight the data as indicated in its documentation.
Simply put, you should insert the svyset [pw=wgt] command immediately after
loading your data with the use command. If your research design covers all European countries, then you need to generate a product of design and population weights to weight respondents properly: insert gen wgt=dweight*pweight
before the svyset command.
If you are working on only one country, or if population (country) weights are
irrelevant to your research design, simply replace wgt by dweight in the svyset
command above and ignore the gen command.

−

Using fre for continuous variables while you should be using su instead. The
most common mistake at the level of descriptive statistics is to use the fre command when the summarize (su) command is appropriate.

140

!
If you are using fre to count valid observations and missing data, you should be
using the codebook command with the c option instead, which also gives you
summary statistics and variable labels, as in this example:
. codebook agea gndr, c
Variable

Obs Unique

Mean

Min

Max

Label
Age of respondent, calculated

agea

50996

87

47.57369

15

123

gndr

51123

2

1.541518

1

2

Gender

Note that in this example, the mean value of the gndr variable is irrelevant, as it
was computed from arbitrary values assigned to gender in this categorical variable.
If you are drowning in more serious problems with missing data than just a few
observations to drop, you should turn back to recoding your missing values, as
explained in Section 8.2.
Finally, do not forget about the count command and the if mi() or if !mi() logical operators. Those often come in handy when you are thinking about selecting variables for crosstabulation or association tests.
−

Assignment No. 1 should be close to three pages. Assignment No. 2 will add to
them, but before starting, you need to check whether your data and variable
descriptions are concise enough. Figures and tables might have pushed you
slightly over, up to four pages or even five pages if you chose a large font or
have a long table of summary statistics. But there is hardly any reason to push
Assignment No. 1 over five pages.
Limit figures and tables in Assignment No. 1 by using a single table for summary statistics, and a single graph for the histogram of your dependent variable: you are not expected to include other graphs like variable transformations
from the gladder command at the level of descriptive statistics. Assignment No.
2 will add more figures and tables.
Copy-pasted Stata output does not count as text. Your text should not include
Stata commands, and your tables should not be Stata output copied and pasted
in your text. Section 13.4 explains how to format summary tables, and the
tabout, tabstat and mat2txt commands shown in the template do-file for Assignment No. 1 will export your results for formatting with Microsoft Office.

−

The following paragraphs summarize what your corrected version of Assignment No. 1 should be telling its reader(s):
1. Introduction: (i) one or two paragraphs to state clearly the research question,
explain its relevance to the general public and with respect to previous literature
on the subject; (ii) one paragraph or two to formulate your argument in terms
of clear and testable hypotheses along with an explanation/justification for
what you predict.
141

!
2. Data: one paragraph that describes the dataset used in your study, describes
and justifies your choice of countries to compare, and mentions the final sample
size after deletion of missing cases.
3. Variables: (i) one paragraph that describes your dependent variable in terms
of its summary statistics –mean, standard deviation, median, minimum, maximum– and its distribution, shown by a histogram; (ii) one paragraph or two to
cite and explain the relevance to your research question and hypotheses of your
choice of independent variables along with a description of their distribution using either proportions–if the variables are categorical (binary, nominal or ordinal)–or summary statistics–if the variables are truly or sufficiently pseudocontinuous.
4. Analysis: one paragraph for every separate association between your DV and
each of your IVs. This section is developed in Assignment No. 2.
5. Conclusion: one or two paragraphs that summarize your results with regard
to your general argument and research question.

15.2. Association
In this Assignment, you test for independence between your dependent variable and
each of your independent variables. The objective is to identify which independent variables are worth keeping for the final regression analysis. The rule of thumb is to keep
only variables that have a statistically significant association with your dependent variable.
To that end, you need to review course material on bivariate tests: replicate each session using the do-files, read through the course handbook chapters and slides, and read
Section 10 as well as various parts of Section 11 for correlation and simple linear regression.
As you will see, there are different types of tests of bivariate association: chi-squared
tests, t-tests, correlation and simple linear regression. The choice of the right test depends on the type of the variables. There are three basic cases:
−

When both variables are categorical, use the Chi-squared test. Produce a table
with column or row percentages, comment on the relationship between the variables, and interpret the statistical significance (p-value) of the Chi-squared statistic.

−

When one variable is continuous and the other is categorical, use the t-test. To
that purpose, the categorical variable needs to be recoded into a binary (0/1)
variable in order to compare the mean value of the continuous variable in each
group. Comment on the differences in means by interpreting the p-value for the
null and each directional hypothesis.

142

!
−

When both variables are continuous, use a simple linear regression. After looking at the number of observations used in the model, interpret the F-statistic, its
p-value and the R-squared. This first step establishes how much statistical power and fit your model provides.
When you are done understanding the overall fit of your model, turn to the independent variable: identify significant coefficients by looking at standard errors, t-values and p-values, and read their direction and magnitude, as covered
in Section 11.

Note the following special cases:
−

The Chi-squared test requires a minimal number of cells counts: if your crosstabulation shows a table where some cells fall below 5 observations, you should
use the Fisher’s exact test instead, as you should with ‘2 x 2’ contingency tables,
regardless of the cell counts. Fisher’s exact test will be computationally more
correct in both cases. Its test statistic reads as if it were a p-value.

−

Interval or ordinal categories offer you a choice of strategies: you can treat
them either as categorical or continuous data. However, if you decide to treat
them as continuous and wish to measure the association of that variable with
another continuous variable, you need to first make sure that there is a linear
relationship between the two variables. To check this, you need to display the
relationship using a scatter plot. If it shows an approximately linear relationship,
you can then use a simple linear regression.

Interpret, usually in two or three sentences at most, each of the tables where you detected a statistically significant relationship between two variables. Report relevant statistics in brackets within your sentences, such as the p-value for a Chi-squared test,
Fisher’s exact statistic (which reads as a p-value itself), or the p-value of a t-test or a
proportions test. When reporting correlations, report Pearson’s r and its p-value; do not
forget to report the intensity and direction of the correlation. If you are dealing with
several statistically significant correlations within your choice of variables, use a correlation matrix to present them.
In your do-file, you have to test all your independent variables, but you do not have
to produce either tables or graphs in your Assignment for cases where the null hypothesis was retained (when no association can be identified under your alpha level of significance). Identically, you should be selective when testing for interactions between
independent variables: produce these tests only if there is a substantive justification to
do so, as when you control for age while testing the association between an independent – explanatory – variable like household income and a dependent variable like the
number of children in the household.
Remember to discuss the statistical significance and substantive importance of all bivariate associations. If you are comparing the behaviour of your DV across countries,
regions or socio-demographic groups, then you should focus here on discussing differ143

!
ences across those groups. Use as many separate tables as needed for crosstabulation
and, if useful, use figures to illustrate your results.

15.3. Regression
Assignment No. 2 also covers simple linear regression, which comes as a natural complement to correlation. Section 11 covers both simple and multiple linear regression, but
you should limit yourself to simple linear regressions with only two variables at play for
this Assignment.
Contrarily to the methods we used while covering association and correlation, regression goes beyond merely detecting relationships between variables: it is a modelling
technique that provides estimations of the model parameters. When reporting on the
regression coefficients and intercept, you should provide your own interpretation of
that model, as illustrated in class.
Including a scatterplot showing the linear fit between your dependent and independent variables can serve to display your simple linear regression. The graph should not
be used as mere illustration: it refines the analysis by providing an informative visualization of the relationship between your variables, from which you can extract additional
observations: is the relationship truly linear, or is it curvilinear like an exponential (quadratic) relationship? Does the scatterplot reveal visually identifiable outliers, what observations do they stand for, and can you explain why they deviate so much from the linear fit? (A classical example is Luxembourg, which always stands out as soon as GDP
per capita is involved as an independent or dependent variable, because its residents
are outstandingly wealthy in comparison to virtually any other country.)

15.4. Reminders
Here are a few reminders as to how you should be organising your work. Most of it will
sound like old news to many of you.
−

Replicate, replicate, replicate the course sessions. To minimize the risk of using
the wrong command, test or graph, the easiest thing is to replicate the last
course session every week. The course website provides every single file used in
class, so that you can run the do-files again.

−

Your do-file should be replicable. When grading, we must be able to run your
analysis again, by running (executing) the do-file. This implies that you send a
do-file that contains no errors, along with your original dataset, just as the
course website provides both so that you can replicate course sessions at home.

−

Write as if you are writing a research paper. Your Assignment is the draft for
your final paper: it should be written as an ‘advance copy’ of it, with correct
English, full sentences, and clear explanations about what you are doing, what
you are finding, and what you think about it.
144

!
If you follow these instructions, bits and pieces of your paper should read as this completely fictional example, which illustrates how to translate a battery of tests into tables,
figures and, most importantly, interpretations:
The association between income and age groups, which reported a statistically
significant relationship with the Chi-squared test (p < 0.01), was also observable
when using continuous measures of both variables. Specifically, the correlation
matrix (Table 1) reports a strong, positive correlation (r = 0.45, p < 0.05) that
confirms their interplay within the sample population. Furthermore, as shown in
Figure 1, their relationship is quasi-linear, except at the highest values of both,
where the linear fit is slightly less truthful to the data. The regression coefficient
can be understood as follows: from the age of 25 onwards (the minimum age
of the respondents in our sample), income, starting at approximately at \$22,000
per year, increases quasi-linearly by \$700 each year on average, which reflects
the effect of career advancement and enhanced employment opportunities on
wages, as well as other factors that might have to do with capital accumulation.
When you are done with this Assignment, the last step towards your final paper will
consist in building a multiple regression model and a final interpretation of your research question, all in the form of a standard research paper. Good luck, and see you
soon!

145

!

16. Final paper
Section still in draft form, but a good restructuring of all assignment sections will help
to make clear what should be mentioned here.
Your final paper is the finish line, the ultimate point, the very last episode of that epic
quest of yours. Fortunately, there is no dragon to beat. However, there is a paper to
write: if your last draft does not already read like a draft paper, you might have several
In many ways, you have already cut off many heads off the Stata hydra: by finding and
preparing your dataset, by producing descriptive statistics, by running association tests
and by thinking, again and again, about your research design. Finally, the one last step
that has definitely revealed the worth of your work has consisted in running a linear
regression model that has assessed the predictive value of your independent variables
Your final paper is primarily a reorganization of your work, which means that you will,
again, be revising previous assignments in order to suppress any potential ambiguity in
your wording or ideas, add what you might have omitted on first submission, and correct any mistakes that was flagged so far.
Rewriting will represent roughly 75% of your work on your final paper (increase that
estimate if your previous assignments came back with a grade below 4 and/or lots of
instructions for revision). The last 25% consist in checking your do-file carefully, reorganising it, producing a log file and sending them all by email, as with your drafts.

16.1. Structure
Your paper follows a scientific style of writing as well as a scientific breakdown of
sections. Read this section even if you have some experience with writing up under
these conventions, and if you do not, read it with extra care, as it will turn out useful
not only for this course but also for many others.
First, your paper needs to be written in scientific style, which means that the writing
will be as simple as possible and only as complex as necessary. Some additional guidelines apply:
−

Because the paper reflects what you know and did, do not use any term or argument that you cannot explain yourself. For instance, your paper will not reference “a sensitivity analysis of cluster sampling over high-resolution data”.

−

Without exception, reference every single item of your paper that you did not
create yourself. This applies to arguments, observations, and also to data: your
dataset needs to be fully referenced. The online source of your dataset will usually give an example citation for it.
146

!
−

Just as with any academic work, your paper is expected to have been carefully
proofread, up to the point where the read should not detect more than one occasional spelling mistake on every page or so. Spell-check your paper and check
your sources for names, acronyms, etc.

can be adapted to this course as follows (note that the example extracts are fictional
and do not represent any real study; real examples are provided later on in this instruction sheet).
−

Your Introduction spells out your research question, outlines the variables of interest, and offers your hypotheses (from Assignment No. 1).
e.g. “I study the relationship between extreme-right voting and socioeconomic
status (SES), as measured through occupation, income and educational attainment. I also control for age, gender, ethnic origin and religious beliefs.
My hypothesis states that extreme-right voters sit at the bottom of social hierarchies within their age and gender groups, and will therefore score lower on all
measurements than other members of the social categories to which they belong to.”

−

is ordinary least squares, or OLS, multiple linear regression.
e.g. “The study uses the last edition of the British Election Study, a survey conducted by […] in May 2010. The dataset, which is available at […], contains
[…] adult respondents. The data were collected through face-to-face interviews
and the method of sampling was […].
The data were searched for significant correlations, which were then explored
through simple and multiple regression analysis in order to identify linear relationships between the variables of interest.”

−

Your Results report your independence tests (from Assignment No. 2) and the
results of your linear regression model.
e.g. “Extreme-right voting does not concern a majority of the population: as
Figure 1 shows, only a small fraction of British voters declared voting for any of
the extreme right political parties.
After observing a significant correlation between […] and […], we can state
with confidence that extreme-right voting is higher in lower income groups.
Figure 2 below plots this relationship for each gender group.
In parallel, our regression of political participation against income also shows
that lower income groups participate significantly less in elections. The results of
that regression are reported in Table 1. The high R-squared (.56) suggests that
income is a major factor at play here.”
147

!
−

Your Discussion concludes on your project, and includes criticism of both your
e.g. “Although the project succeeded at showing that income and educational
attainment are predictors of extreme-right voting, the weak association suggests that other important factors come into play in explaining this relationship.
Furthermore, our hypothesis that religious behaviour would have a significant
impact on voting was not confirmed by our analysis. The small sample size for
our independent variables measuring religiosity limited the significance of our
tests and constitutes an important drawback of our study.”

16.2. Limits
−

Paper. Your research paper should fit on a maximum of 10–12 pages, using the
standard format defined during our last course session. There is no length limit
for the number of lines of code in your do-file, but anything outside of the 100–
400 range will probably indicate something strange.

−

Graphs. Include only relevant graphs that help to understand the relationships
mentioned in your text. Choose the type of graph carefully, as explained in class
and in several sections of this guide.
The feedback on Assignment No. 2 will usually include some notes on which
graphs to include, but the simplest way to know whether or not to include a
graph in your final paper is to judge whether it brings anything of value to the
rest of your analysis; if the answer is anything but ‘absolutely yes’, do not include the graph.

−

Tables. Do not include tables except for your most significant outputs: summary
statistics, correlation matrix, and regression models. For other significance tests,
report the results (and especially the p-value) directly in your text. Presenting
and exporting tables is covered in Section 13.4.

16.3. Example
The following extracts are taken from a recent working paper published by the United
Nations Development Programme, which is used to illustrate what a research paper using quantitative data and methods should contain.
These are the opening lines of the text:
Introduction1
This paper examines the variation across countries and evolution over time of
life expectancy.

148

!
The opening section examines the impact of national income, measured as GDP
per capita in PPP, in Preston and augmented Preston regressions. Rather than
focus only on recent cross-sections since 1970 or so we use the available historical data going back to the beginning of the 20th century (the data are taken
from the series created for the GAPMINDER application and are described in the
data appendix). This long-run focus allows us to establish several basic facts
…
1
Source: Lant Pritchett and Martina Viarengo, “Explaining the Cross-National Time Series Variation in Life Expectancy: Income, Women’s Education, Shifts, and What Else?”
UNDP Research Paper 31, October 2010.
The authors immediately inform the reader about the dependent variable, life expectancy, and then submit the first independent variable, income, and its proxy (the means
of measurement for it), which here is GDP per capita in PPP.
The method of analysis – some form of regression – is also mentioned, and the data are
described by providing the source and the time period covered. It is good practice to
store a detailed description of the data sources in an appendix, which the authors do
(see pages 60–63 of their paper).
The first footnote acknowledges some colleagues for their help with writing the paper:
if you have received help from anyone else than the course instructors, including other
students from the class or people who you might have emailed about accessing your
dataset, you should acknowledge their help.
The text continues as such:
First, there has been a strong cross-national relationship between income and
life expectancy for as far back as one can take the data. In the simple double
natural log Preston curve (life expectancy regressed on GDP per capita) the Rsquared for the 21 countries with data was as high as .8 as early as 1927 and
was at that level through the pre-World War II period. The modern data sets
with over 150 countries begin in 1952 and have availability every five years and
in that data there has been a high and rising R-squared roughly ever since (once
one controls for the AIDs affected countries).
This paragraph should be almost fully understandable now that you have completed
the course. It mentions the sample size, the variables used by the authors in their hypothesis test (life expectancy and GDP per capita), as well as the shape of the distribution revealed by this hypothesis (a natural logarithm).
The extract also show that important aspects of your analysis should be mentioned directly in the text, like the R-squared or the control variables (here, the authors set aside
the units of observation – countries – with high infection rates of HIV/AIDS).

149

!
The only part that requires further explanation is the Preston curve, which is a classic
finding by Samuel H. Preston: when regressed onto GDP per capita for cross-sectional
data, life expectancy follows a natural logarithmic distribution. Learn more on Wikipedia: http://en.wikipedia.org/wiki/Preston_curve.

16.4. Reminders
−

Normalise your files and emails. Files sent without normalisation at this stage
run the unacceptable risk of either delaying the grading process or (even worse)
getting lost in dozens of other emails. You have almost all done very well with
this, for which you earn infinite gratitude from virtually every grader in the
world; please do so one last time.

−

Unlike deadlines for midterm assignments, the deadline for the final paper is
completely intangible, as it corresponds to the last days before which grading
can be performed in acceptable conditions for formal submission of the grades
to the Sciences Po administrative units. Late work will therefore be dismissed.

The deadline will appear in a class email along with additional guidance.
Good luck, and well done!
We wish you the best of luck in all your future endeavours. Please submit some feedback on the course, and let’s meet later on for drinks and/or food.

150

!
Datasets:

ESS

European Social Survey (used in Sections 6 and 8)
4th wave (2008)
http://ess.nsd.uib.no/

QOG

Quality of Government (used in Sections 9, 10 and 11)
Most recent update (6 April 2011)
http://qog.pol.gu.se/

NHIS

National Health Interview Survey (used in Sections 7, 9 and 10)
Last survey year (2009)
http://www.cdc.gov/nchs/nhis/

Commands: (in progress…)

extremes, 82

mvencode, 63

151

!
Applied examples:

Application 5a. Weighting a cross-national survey!

Data: ESS 2008

34!

Application 5b. Weighting a multistage probability sample!

Data: NHIS 2009

36!

Data: ESS 2008

39!

Application 6a. Locating and renaming variables!

Data: ESS 2008

43!

Application 6b. Counting observations!

Data: ESS 2008

45!

Application 6c. Selecting observations!

Data: ESS 2008

45!

Application 7a. Subsetting to cross-sectional data!

Data: NHIS 2009

53!

Application 8a. Inspecting a categorical variable!

Data: NHIS 2009

57!

Application 8b. Inspecting a continuous variable!

Data: NHIS 2009

58!

Application 8c. Labelling a dummy variable!

Data: NHIS 2009

58!

Application 8d. Recoding continuous data to groups!

Data: NHIS 2009

61!

Application 8e. Recoding dummies!

Data: NHIS 2009

61!

Application 8f. Recoding bands!

Data: NHIS 2009

63!

Application 8g. Encoding strings!

Data: MFSS 2006

66!

Example 9a. Visualizing continuous data!

Data: NHIS 2009

72!

Example 9b. Kernel density plots!

Data: NHIS 2009

74!

Example 9c. Visualizing categorical data!

Data: ESS 2008

74!

Example 9d. Survey weights and confidence intervals!

Data: ESS 2008

76!

Example 9e. Democratic satisfaction!

Data: ESS 2008

78!

Example 9f. Normality of the Body Mass Index!

Data: NHIS 2009

82!

Example 9g. Transforming the Body Mass Index!

Data: NHIS 2009

83!

Example 9h. Inspecting outliers!

Data: NHIS 2009

85!

Example 9i. Keeping or removing outliers!

Data: QOG 2011

85!

Example 10a. Trust in the European Parliament

89!

Example 10b. Female leaders and political regimes

91!

Example 10c. Religiosity and military spending

93!

Example 10d. Religion and interest in politics

95!

Example 10e. Legal systems and judicial independence

97!

Example 10f. Party support (with controls)

105!

Example 11a. Foreign aid and corruption

110!

Example 11b. Trust in institutions

112!

Example 11c. Subjective happiness

115!

152

This version: 23 February 2012.
Written using Stata 11/12 SE on Mac OS X Lion.
Typeset in Linotype Syntax and Menlo.

!

I rest my head on 115
But miracles only happen on 34th, so I guess life is mean
And death is the median
And purgatory is the mode that we settle in

– Cannibal Ox, “Iron Galaxy”

```

Source Exif Data:
```File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
Linearized                      : No
Page Count                      : 154