Stata guide 2012

Acknowledgements go first and foremost to Ivaylo Petev, with whom I co-taught three of the

five courses for which this guide was written. I also benefitted from a lot of friendly advice from

Baptiste Coulmont, Emiliano Grossman, Sarah McLaughlin, Vincent Tiberj and Hyungsoo Woo.

All mistakes and omissions, as well as the views expressed, are mine and mine alone.

Statistics with Stata

Student guide

Version 0.9.8.4, by François Briatte

Contents

Introduction 2!

1.!Basics 3!

2.!Computers 8!

3.!Stata 16!

4.!Research 21!

Data 30!

5.!Structure 31!

6.!Exploration 42!

7.!Datasets 49!

8.!Variables 56!

Analysis 70!

9.!Distributions 71!

10.!Association 87!

11.!Regression 108!

12.!Cheat sheet 118!

Projects 124!

13.!Formatting 125!

14.!Assignment No. 1 133!

15.!Assignment No. 2 140!

16.!Final paper 146!

Draft version, check for updates!

!

2

Introduction

This guide was written for a set of five quasi-identical postgraduate courses run at Sci-

ences Po in Paris from Fall 2010 to Spring 2012. The full course material appears online

at this address: http://f.briatte.org/teaching/quanti/.

The course is organised around three learning objectives:

First, it introduces some essential aspects of statistics, ranging from describing varia-

bles to running a multiple regression. The course requires reading statistical theory ap-

plied to social surveys as a preliminary to all course sessions.

Second, it introduces how to operate those procedures with Stata, a statistical software

that we will practice during class. The course also requires that you practice using Stata

outside class in order to become sufficiently familiar with it.

Third, the course will lead you to develop a small research project, on which your

grade for the course will be based. The course therefore requires regular attendance

and homework, which will lead to writing up that research project.

This guide covers the following topics:

− The course basics, Stata fundamentals and essential computer skills

− Basic operations in data preparation and management (Part 1)

− Introductory quantitative methods and statistical analysis (Part 2)

− Instructions for the assignments and the final paper (Part 3)

!

3

1. Basics

Quantitative methods designate a specific branch of social science methodology, within

which statistical procedures are applied to quantitative data in order to produce inter-

pretations of complex, recurrent phenomena.

Just as in other domains of scientific inquiry, the complexity and precision of statistical

procedures are necessary requirements to the study of some large-scale phenomena by

social scientists. Recent examples of such topics include the evaluation of a program

aimed at developing fertilizer use in Kenya (Duflo, Kremer and Robinson, NBER Work-

ing Paper, 2009), an explanation of attitudes towards highly skilled and low-skilled im-

migration in the United States (Hainmueller and Hiscox, American Political Science Re-

view, 2010), and a retrospective electoral analysis of the vote that put Adolf Hitler into

power in interwar Germany (King et al., Journal of Economic History, 2008).

Quantitative methods courses come with a particular set of principles, which might be

arbitrarily summarized as such:

− Researchers learn and share their knowledge of quantitative methods to the

largest possible audience, and to the best of their abilities.

− Quantitative data are shared publicly, along with all necessary resources to rep-

licate their analysis (such as do-files when using Stata).

On the learning side, some very simple principles apply:

− Quantitative methods are accessible to everyone interested in learning how to

use them. Curiosity comes first.

− There is no learning substitute to reading, practicing and looking for help, from

all kinds of sources. Reading comes first.

− Making mistakes, correcting one’s own errors and hitting one’s own limits are

intrinsic to learning. Trial-and-error comes first.

Statistical reasoning and quantitative methods are intellectually challenging for teachers

and students alike, and a collective effort is required for the course to work out:

− You will have to attend all course sessions: your instructors expect systematic

attendance; catch up with the sessions that you have missed.

− When attending classes, attend classes: your instructors will feel completely

useless if you do anything else, like reading your email or browsing whatever.

− Assignments will be graded in order to monitor your progress: no assignments,

no progress towards your final research project, no grades.

− Read all course material: read everything you are told to if you want to under-

stand what you are learning in this course.

!

4

These, of course, are similar to the course requirements for your other classes.

1.1. Homework

Apart from attending the weekly two-hour course sessions, you are required to:

− Complete readings from the handbook and other material, as indicated in the

course syllabus. This will take you approximately one hour per week, perhaps

two if statistics are completely new to you.

− Replicate course sessions outside class, using the do-files provided on the

course website. This will take you between half an hour and one hour per week,

depending on your learning curve.

− Work on your research project and assignments, using the instructions provid-

ed during class and in the course documentation. Your project will require be-

tween one hour and a half to three hours of work per week, depending on your

learning curve and on the project itself.

In total, the time of study for this course amounts to two hours of class and between

three to six hours of homework spent on your research project. The time of study for

this course is variable, but a fair estimate is that you will spend between five and eight

hours per week studying for this course.

It is important to state right from the start that it is not possible to follow the course

irregularly, either by skipping weeks and trying to catch up later, and/or by allocating

long periods of last-minute work before deadlines. Experience shows that these strate-

gies systematically lead to low achievement and grades.

1.2. Assignments

The course was conceived with a hands-on focus. This means that you are not ex-

pected to take lecture notes and then revise for a final graded examination. Instead,

you will develop a research project throughout the course.

Your work will be corrected, commented and assessed twice during the course,

through assignments that will help monitor your progress. The final version of your pro-

ject will make up for the largest part of your grade.

In practice, your assignments are ‘open assignments’ that you can complete with all

the resources you need at hand (notes, guides, online help). This cancels out a strategic

skill often observed among students: memorising large amounts of information for just

one occasional exam. Memorizing will not work at all for this course. Instead, you will

have to learn and practice regularly throughout the semester. If you are not used to

that method of study, then the course will be one critical opportunity to learn it.

!

5

The assignments are also cumulative: Assignment No. 1 will be revised and included in

Assignment No. 2, just as your final paper will draw extensively on the revised versions

of both assignments. To learn more on how to complete assignments, read the instruc-

tions provided in Part 3 of this guide.

1.3. Communication

Let’s immediately clarify two things about student-instructor communication for this

particular course:

− Email will be used for feedback and all correspondence. To simplify this pro-

cess, we will use normalized email subjects.

A normalized email subject looks like “SRQM: Assignment No. 1, Briatte and

Petev”, where “SRQM” is an acronym for the course, and “Briatte and Petev”

are the family names for you and your study partner.

Normalized email subjects apply to all correspondence, not just when sending

assignments. To ask a question on recoding, you should hence use “SRQM:

Question on recoding”.

When working in pairs, always copy your partner when sending emails and as-

signments. Identically, send all your emails to both course instructors if you

want a reply (and especially a quick one).

− You should ask questions in class and email additional ones, after having

made sure that the answer is not already included in the course material. That

implies that you take ownership of that material, rather than passively absorb it

as lecture notes.

You should not feel uncomfortable asking questions in class. Neither should

you expect others to ask questions for you. Unfortunately, you might have sur-

vived several courses doing precisely that until now, and might survive more in

the future doing the same. This course, however, runs on personalised projects

that do not allow exit or free riding.

The extra effort that is required from you on that side is the counterpart to of-

fering a course where you are learning through practice rather than by rehash-

ing a handbook into a standardized examination or abstract problem sets with

no empirical counterpart.

Your instructors receive multiple emails from several classes; normalization really helps

in sorting out large volumes of email. There are no direct sanctions for not following

this principle, but there are indirect costs, especially if your assignment emails get lost in

the instructors’ grading pile or if you end up waiting three weeks to send a question

that will get you late on your project.

!

6

1.4. Research project

The course is built around your elaboration of a small-scale research project. Because

the course is introductory by nature, several limits apply:

− You are required to use pre-existing data, instead of assembling your own data

and building your own dataset, which is a much longer process that require ad-

ditional skills in data manipulation.

− You are required to use cross-sectional data, because time series, panel and

longitudinal data require more complex analytical procedures that are not cov-

ered in introductory courses.

− You are required to use continuous data, because discrete variables also use dif-

ferent techniques not covered at length in the course. This applies principally to

your choice of dependent variable.

These requirements and their terminology are covered in Section 5. For now, just re-

member this basic principle: your research will be based on one dependent variable,

sometimes called the ‘response’ variable, and you will try to explain this variable by

predicting the different values that it takes in your sample (dataset) by using several

independent variables, sometimes called the ‘explanatory’ variables, or ‘predictors’, or

‘regressors’ in technical papers.

Because your research is a personal project, you might bend the above rules to some

extent if you quickly show the instructors that you can handle additional work with da-

ta management. The following advice might then apply:

− If you are assembling your data by merging data from several datasets, you will

be using the merge command in Stata (Section 5.2). You might also choose to

use Microsoft Excel for quicker data manipulation. Do not assemble data if you

do not already have some experience in that domain.

− If you are converting your data, refer to the course and online documentation

on how to import CSV data into Stata using the insheet command, or how to

convert file formats like SAV files for SPSS. Always perform extensive checks to

make sure that your data were properly converted into a readable, valid file.

− If you are interested in temporal comparison, such as economic performance

before and after EU accession, you can compute a variable that will capture, for

example, the change in average disposable income over ten years. Stick with a

single variable, and ask for advice in class before proceeding.

− If you have selected nominal data as your dependent variable, such as religious

denomination, then something went wrong in your research design—unless you

know about multinomial logit, in which case you should skip this class. Please

identify a different variable that is either continuous or ‘pseudo-continuous’—

i.e. a categorical variable with an ordinal (or better, interval) scale, such as edu-

cational attainment.

!

7

1.5. Guidance

This guide works only if you use it. Its writing actually started with students questions:

several sections were first written as short tutorials concerning specific issues with data

management. One thing led to another, and we ended up with the current document.

The aim is to cover 99% of the course by version 1.0. A handful of students also pro-

vided valuable feedback on the text—thanks!

The Stata Guide is a take on the course requirements: it offers a narrative for the

commands used, by tying them up together with core elements of statistical reasoning

and quantitative methods. The guide aims at covering the most useful introductory

concepts of statistics for the social sciences, and to offer a detailed exploration of these

concepts with Stata.

The ‘introductory’ term is important: the course focuses on selected commands and

options to work through selected operations of ‘frequentist’ statistics. To fully support

these learning steps, it also introduces basic computer and research skills that are often

missing to the training of students, and offers a way to assess all aspects of the course

through a small-scale research project.

Several sections of the guide are still in draft form, so watch for updates and read it

along other documentation. As explained in Section 2.5, there is a wealth of documen-

tation out there, and you should be able to locate help files to complement the Stata

Guide on how to use Stata to analyse your data.

!

8

2. Computers

A course on quantitative methods is bound to make intensive use of computer soft-

ware. You all use computers routinely for many different activities, but your level of

familiarity with some of the fundamental aspects of computers can vary dramatically.

Being reasonably familiar with computers is required for this class.

Please read this section in full and assess whether you are familiar enough with the

notions covered, otherwise you should start practicing as soon as possible. A reasona-

ble level of familiarity with computers will help you with using Stata and completing

assignments, and will also generally come in handy.

2.1. Basics

The course requires minimal computer skills. In order to open and save files in Stata,

you should be able to:

− Locate files using their file path. In recent, common operating systems, a file

path looks like /Users/fr/Courses/SRQM/Datasets/qog2011.dta in Mac OS X,

or like C:\Users\Ivo\Desktop\SRQM\Replication\week2.log in Windows. Get

used to these if you have never used them before.

− Locate online resources using their URL. The URL for the course website is

http://f.briatte.org/teaching/quanti/. We will use URLs extensively when guid-

ing you through coursework and course material.

− Understand file and memory size, which is often displayed in megabytes (MB).

Using Stata 11 or below correctly requires setting memory to load large files: for

instance, set mem 500m sets memory to 500MB.

2.2. Filenames

Filenames are another essential aspect of computer use, especially when you are han-

dling a large number of files and/or using multiple copies of the same file. Some gen-

eral recommendations apply:

− In all cases, filenames should be short and informative. Regularly accessed

files, like datasets, have short filenames for faster manipulation, and contain the

time period covered by the data.

− In some cases, filenames require normalization. This implies using sensible file-

names and standard version numbers for files that are chronologically ordered.

This point is important because you will be required to normalize the filenames

for your work files in this class (Part 3).

!

9

2.3. Equipment

Regarding computer equipment, you will need:

− Access to a computer, both at university and at home. You should bring your

personal laptop to class if you own one. Make sure that you know how to work

with your computer and that it is fast enough.

− A university email account subscribed to the course, and possibly a personal

email account to share larger files and to backup your work. The standard solu-

tion for an efficient work mailbox is Gmail.

− Access to the ENTG, as provided by Sciences Po. The course will use the “Doc-

uments” pane to share the course emails and readings. Other files will be avail-

able from the course website.

− A word processor to type in your final paper. Despite being a worldwide stand-

ard, Microsoft Word is unstable: always backup your work. Any solution is

good as long as it can be printed to a PDF file.

− A working copy of Stata (our software of choice, introduced in Section 3) on

the computer(s) used during the course and at home. This point will be dis-

cussed during class. Stata includes a plain text programming editor.

− A USB stick, to build a course ‘Teaching Pack’ by saving and organizing all

course material, as well as the files from your research project. Always make

regular backups of your data in at least two different locations.

Some of these items will be provided to you through Sciences Po. Please make sure that

you have equipped yourself as early as possible in the semester, and as indicated sever-

al times in the list above, always backup your work!

2.4. Downloads

The course will regularly require that you locate and download resources online, from

datasets to do-files, as well as other course material, mostly in PDF format or as ZIP ar-

chives. Make sure that you know how to handle these formats.

When downloading files, do not use counter-productive browser settings that down-

load files into temporary folders, or that automatically open files, or even worse, that

add file extensions to your downloads. For instance, if your browser automatically adds

a “.txt” extension to your do-files, you will need to rename the file by turning its file

extension back to just “.do” to open it in Stata.

Google Chrome, Mozilla Firefox and Apple Safari are common Internet browsers with

appropriate “Save As” options available from their contextual menus. The example be-

low shows the contextual menu for Google Chrome on Mac OS X.

!

10

All course material should be archived into a structured folder hierarchy, which will

depend on your own preferences and operating system. A simple hierarchy, such as

~/Documents/SRQM/ on Mac OS X, will let you access all files quickly.

2.5. Help

Quantitative methods cannot be learnt once and for all: the course will require that

you frequently search for help, often from online sources. Always consult the course

material for each session before seeking additional help: the answer is very often just

before your eyes or a few clicks away.

If you are looking for help on a Stata command, use the help command to access the

very large internal documentation included in Stata. Even experienced users use help

pages on a daily basis. Learning to use Stata help pages is a course objective in itself.

Stata help command: http://www.stata.com/help.cgi?help

If you are looking for help on statistics, please first refer to the course readings listed in

the course syllabus. Feinstein and Thomas’ Making History Count (Cambridge Universi-

ty Press, 2002) is the main handbook for this course; help on graphics and other topics

can also be found in the additional readings.

If you are looking for help on statistical procedures in Stata, please first refer to the

course website for a selection of Stata tutorials. Two American universities, Princeton

and UCLA, have produced excellent Stata tutorials that cover similar material than the

course sessions. More tutorials are available online.

Course website: http://f.briatte.org/teaching/quanti/

If you are stuck, do not panic! Please first make sure that you have explored the soft-

ware and course resources listed above. It is safe to assert that 99% of Stata questions

for this course can be answered from the course material. If still stuck, try a Google

search on your question: thousands of online sources hold answers to identical ques-

tions asked by Stata users around the world. Researchers often check the Statalist and

the statistics section of the StackOverflow website for answers to their own questions.

Finally, if still stuck, and in this case only, email us to ask your questions directly. It

would be preferable if email correspondence could be limited to questions on your re-

search design, rather than questions that could be answered by simply reading the

course material mentioned above.

!

11

2.6. Commands

Stata can be used either through its Graphical User Interface (GUI), like most soft-

ware, or through a ‘command line’ terminal, which is a very common aspect of pro-

gramming environments. As explained just below, learning how to use the command

line and writing do-files are compulsory for replication purposes.

The ‘command line’ terminal works by entering lines of instructions that are repro-

duced, along with their results, in another window. The next sections of this guide ex-

plain further how Stata works and document the usual commands used in this course

for data management, description, analysis and graphing. A ‘cheat sheet’ for these

commands is offered in Section 12.

The screenshot below shows an example of such commands, typed manually in the

Command window (top); after running them by pressing Enter, their output showed in

the Results window (bottom).

!

12

Command line terminals work by entering commands, such as set mem 500m (which

assigns 500MB of computer memory in Stata 11–). When you press Enter, Stata will try

to execute, or ‘run,’ the command, which might occasionally take a little bit of time if

your data or command are computationally intensive.

If your command ran successfully, Stata will display its result:

Note that some commands produce ‘blank’ outputs in the Results window, i.e. the

command was successfully entered and executed, but there is no indication of its actual

result(s). In these cases, a simple “.” line dot will appear in the Results window, as to

show that Stata encountered no problem while executing the command, and that it is

ready to process another one.

If the command is not valid, which often happens due to typing errors or for other rea-

sons related to how Stata works, it will display an error or a warning. In that case, you

have to fix the issue, often by re-typing the command correctly (as in the ‘summarizze’

example below), or by checking the documentation to understand where you made a

mistake and how to fix it.

Stata commands are case-sensitive. Use only lowercase letters when typing commands.

Variables can come in both uppercase and lowercase letters, which you will have to

type in exactly to avoid errors. As a rule of thumb, when creating variables, use only

lowercase letters.

If you need to correct an invalid command or re-run a command that you have already

used earlier on, you can use the PageUp or Fn-UpArrow keys on your keyboard to

browse through the previous commands that you typed, which are also displayed in the

Review window.

503.359M

set matsize 400 max. RHS vars in models 1.254M

set memory 500M max. data space 500.000M

set maxvar 5000 max. variables allowed 2.105M

settable value description (1M = 1024k)

current memory usage

Current memory allocation

. set mem 500m

age 24291 46.81392 17.16638 18 84

Variable Obs Mean Std. Dev. Min Max

. summarize age

r(199);

unrecognized command: summarizze

. summarizze age

!

13

Some commands can be abbreviated for quicker use. If you run the help summarize

command in Stata, the help window will tell you that the summarize command can be

shorthanded as su:

Abbreviations exist for most commands and come in handy especially with commands

such as tabulate (shorthand tab), describe (shorthand d) or even help (shorthand h).

They also work for options like the detail option for the summarize command:

Some commands have particular attributes. Comments, for example, are lines of ex-

planation that start with * or //. They are not executed, but are necessary to make your

do-files and logs understandable by others as well by yourself. The first and third lines

in the example below are comments.

You will find many comments in the course do-files: use them to describe what you are

doing as thoroughly as necessary. You will be the first beneficiary of these comments

when you reopen your own code after some time.

age 24291 46.81392 17.16638 18 84

Variable Obs Mean Std. Dev. Min Max

. su age

99% 83 84 Kurtosis 2.09877

95% 77 84 Skewness .234832

90% 71 84 Variance 294.6846

75% 60 84

Largest Std. Dev. 17.16638

50% 46 Mean 46.81392

25% 32 18 Sum of Wgt. 24291

10% 24 18 Obs 24291

5% 21 18

1% 18 18

Percentiles Smallest

Age

. su age, d

bmi 24291 27.27 5.134197 15.20329 26.57845 50.48837

variable N mean sd min p50 max

. tabstat bmi, s(n mean sd min median max)

. * Summary statistics for BMI.

. gen bmi = weight*703/height^2

. * Creating a variable for Body Mass Index (BMI).

!

14

Additional commands can be installed. Stata can ‘learn’ to ‘understand’ new com-

mands through packages written by its users, most often academics with programming

skills. We will use the ssc install command at a few points in this guide to install some

of these packages. Installation with ssc install requires an Internet connection.

Right away, you should install the fre command in Stata by typing ssc install fre, as we

will use this command a lot to display frequencies. Other handy commands like catplot,

spineplot or tabout will be installed throughout the course, as in the example below,

which shows possible installation results:

2.7. Replication

In this guide, terms like commands, logs and do-files collectively designate an essential

aspect of quantitative methods: replication, i.e. providing others as well as yourself with

the means to replicate your analysis.

Replication requires that you keep your original files intact. The dataset that you will

use for your research project should be left unmodified in your course folder, and

should be provided along with your other files when handing in assignments.

Replication also requires the list of commands you used to edit your data, for instance

to drop observations or to recode variables, as well as the commands that you used to

analyse the data, such as tabstat, histogram and regress. The commands can all be

stored into a single text file, with one command appearing on each line: this structure is

common to computer scripts and programs.

A do-file is a text file that contains your commands and comments. A log file is a sepa-

rate text file that contains these commands, along with their results. The production of

do-files will be practiced in class, and additional documentation appear in many Stata

tutorials listed in the course material.

Replication files are a crucial aspect of programming. If you open the do-files for this

course in the Stata do-file editor or in any other Stata-capable editor, you will notice

that the files feature line numbers and a coloured syntax. These generic features are

built in most programming environments.

Learning to understand and write in programming languages takes time, and therefore

constitutes a particular skill. Writing do-files in Stata requires learning commands and

their syntax, exactly like languages require learning vocabulary and grammar. Just like

all files already exist and are up to date.

checking fre consistency and verifying not already installed...

. ssc install fre

installation complete.

installing into /Users/fr/Library/Application Support/Stata/ado/stbplus/...

checking tabout consistency and verifying not already installed...

. ssc install tabout

!

15

with languages, the learning curve also decreases once you already know one. Finally,

since programming can also reflect high or low writing skills, you should read the cod-

ing recommendations in Section 13.3 on coding before submitting your own work.

!

16

3. Stata

Our course uses a recent version of Stata, a common software choice in social science

disciplines. Using any statistical software requires some basic skills in file management

and programming. The following steps apply to virtually any Stata user, and should be

practised until you are familiar enough with them.

Once you start exploring the Stata interface, you will realize that most windows can be

hidden to concentrate on your commands, do-files and results, as below. We won’t use

any other element of the interface in this course, but feel free to explore Stata and use

other functionalities.

3.1. Command line

Stata has a graphic user interface (GUI) and a command line system. The latter is much

more versatile and teaches you the syntax used by Stata. More importantly, the com-

mand line forces you to plan your work with data management and analysis.

Because the commands entered through the command line can be recorded, i.e. stored

as logs (see below), it will enable you to maintain a record of your operations and to

store comments along. This step is essential to keep afloat with your own work, as well

as to share it with others, usually as do-files.

The Stata GUI can be used occasionally for routine operations that need not appear in

your do-files. Keyboard shortcuts also save some time, as with ‘File > Open…’ (Ctrl-O

in Stata for Windows) or ‘File > Change Working Directory…’ (Cmmd-Shift-J in Stata

for Macintosh; this document uses ‘Cmmd’ to designate the ‘⌘’, a. k. a. ‘Command’,

‘Cmd’ or ‘Apple’ modifier key).

!

17

3.2. Memory

Stata 12 works memory on its own, but older versions of Stata usually open with a very

small memory allocation for data. To safely open large files in Stata 11 or below, we

recommend you run the set mem 500m command to allocate it 500MB.

Very large datasets might require allocating more memory, using a different version of

Stata with enhanced capacities, or even switching from Stata to software with higher

computational power. This course will not require doing so.

3.3. Working directory

The working directory is the folder from which Stata will open and save files by default.

You will have to set the working directory every time you launch Stata. The path to

your working directory depends on your system (Section 2.1).

To learn what is the current work directory, use the pwd command. To set it to a new

location, type cd followed by the path to the desired folder. To list the contents of the

working directory, use the ls or dir command.

Your working directory should be the main ‘SRQM’ folder for this course, which we

also call the ‘Teaching Pack’ because you will be required to download all course mate-

rial to it. Download the Teaching Pack from the course website and unzip it to an easily

accessible location, such as your Documents folder.

The example below reflects all directory commands for a user called ‘fr’ using Mac OS X

to change the Stata working directory from the user’s Desktop to the SRQM folder,

which was downloaded and moved to the Documents folder:

The quotes around the file path are optional in this example, but are compulsory if your

file path contains spaces. The ls command above was given the wide (shorthand w)

option to make its output simpler to understand.

If you are unsure what the path to your SRQM folder is, do not just ignore this step as

if it were optional. Select ‘File > Change Working Directory…’ in the Stata menus, and

from there, select your SRQM folder.

Course/ Replication/ emails.txt readme.txt*

Admin/ Datasets/ Software/ readme.pdf* website.url*

. ls, w

/Users/fr/Documents/Teaching/SRQM

. cd ~/Documents/Teaching/SRQM/

/Users/fr/Documents

. pwd

!

18

3.4. Open/Save

Stata can use the usual open/save routine that you are familiar with from using other

software. It can also open datasets and save them from the command line if you have

correctly set your working directory in the first place.

The example below shows how to download a Stata dataset from an online source and

then save it on disk. The use command with the clear option removes any previously

opened dataset from memory, and the save command with the replace option will

overwrite any pre-existing data:

In this course, you will never have to save any data: instead, you should leave all da-

tasets intact and use do-files to transform them appropriately. This will ensure that your

work stays entirely replicable.

3.5. Log files

The log is a text file that, once open with log using, will save every single command

you enter in Stata as well as its results. Systematically logging your work is good prac-

tice, even when you are just trying out a few things. Logs can be closed with the log

close command followed by the name of your log if it has one:

file datasets/trust.dta saved

. save datasets/trust.dta, replace

. use http://f.briatte.org/teaching/quanti/data/trust.dta, clear

closed on: 17 Feb 2012, 00:02:53

log type: text

log: /Users/fr/Documents/Teaching/SRQM/example.log

name: example

. log close example

76

. count if sa_mr==0

. * Count countries with 0% malaria risk in 1994.

. use datasets/qog2011, clear

opened on: 17 Feb 2012, 00:01:42

log type: text

log: /Users/fr/Documents/Teaching/SRQM/example.log

name: example

. log using example.log, name(example) replace

!

19

Comments will also be saved to the log file, which is particularly useful when you have

to read through your work again or share it with someone. In the example above, all

comments, commands and results were saved to the log file.

3.6. Do-files

Logs are useful to save every operation and result from a practice session. If you need

someone else to replicate your work, however, you just need to share the commands

you entered, along with the comments that you wrote to document your analysis. Files

that contain commands and comments are called do-files.

Writing do-files is a crucial aspect of this course. Absent of a do-file, your work will be

mostly incomprehensible, or at least impossible to reproduce, to others. Your do-file

should include your comments, and it should run smoothly, without returning any er-

rors. You will discover that these steps require a lot of work, so start to program early.

To open a new do-file, use either the doedit command or the ‘File > New Do-file’ menu

(keyboard shortcut: Ctrl-N on Windows, Cmmd-N on Macintosh).

You should take inspiration from the do-files produced for the course to write up your

own do-file for your research project. All our do-files are available from the course web-

site. This course requires only basic programming skills, as illustrated by the do-files that

we run during our practice sessions. More sophisticated examples can be found online.

To execute (or ‘run’) a do-file, open it, select any number of lines, and press Ctrl-D in

Stata for Windows or Cmmd-Shift-D in Stata for Macintosh. You can also use either the

GUI icons on the top-right of the Do-file Editor window, or use the do or run com-

mands. Use the Ctrl-L (Windows) or Cmmd-L (Macintosh) keyboard shortcuts to select

the entire current line in order to run it.

Get some practice with do-files as soon as possible, since your coursework will include

replicating one do-file a week. Replicating is nothing more than reading through the

comments of a do-file, while running all its commands sequentially.

3.7. Shutdown

When you are done with your work, just quit Stata like you would quit any other pro-

gram. At that stage, any unsaved operation will be lost, so make sure that your do-file

contains all the commands that you might want to replicate.

To quit with the command line, use log close _all to tell Stata to close all logs, and then

type exit, clear to erase any data stored in Stata memory and quit. Alternatively, just

exit Stata like any other program to close logs and clear data automatically.

Remember not to save your data on exit (Section 3.4).

!

20

3.8. Alternatives

This course uses Stata (by StataCorp) as its statistical software of choice. Stata is com-

monly used by social scientists working with quantitative data in areas such as econom-

ics and political science. It is a powerful solution that provides a good middle ground

between spreadsheet editors and R, which is the most powerful–and least expensive,

since it’s free and open source–but also the most difficult statistical choice of software.

Stata is also more advanced than SPSS because of its emphasis on programming, which

has led to the development of a large set of additional packages. Most statistical proce-

dures know some form of implementation in Stata, and the software is supported by a

large user community that meets on the Statalist mailing-list.

Stata has a few limitations. Its graphics engine is not bad, but not excellent either. It is

not as capable as SAS with large datasets, nor as focused on a particular approach to

quantitative analysis as EViews for econometrics. Finally, unlike free and open source

software, it is a commercial product.

Within these limitations, Stata remains an appropriate solution for the kind of proce-

dures that you will learn to use during this course. Its programming features, operated

through the command line, are central to the learning objectives of the course.

The Stata website will tell you more about the different versions of Stata. It also holds

an online version of the software documentation: http://www.stata.com/. The website

also links to Stata books, journals, and to the Statalist mailing list.

If you are planning to continue using quantitative methods during your degree, you

should also start learning more about R as soon as you are familiar with Stata. Alterna-

tives to Stata are documented in the course material.

!

21

4. Research

The course is built around small research projects, on which you will write your final

paper. Every student (grouped in pairs when applicable) is expected to participate,

which requires some basic knowledge of scientific reasoning.

Scientific research aims at establishing theories of particular knowledge items, such as

elections (political science), continents (geography), international trade (economics),

history, proteins, galaxies and so on. All these items are grounded in real events that

are partially processed through theoretical models of what they represent: competitions

of political elites within the structural constraints of partisan realignments, drifting tec-

tonic plates on top of the lithosphere, markets dominated by agents interested in mac-

roeconomic performance, representations of particular historical events, biological com-

pounds of amino acids, gravitational systems of stars… Our collective knowledge of

reality is directly mediated by these abstract conceptions.

Quantitative social science explores some particular phenomena of usually large scale,

in order to produce complex explanatory models that follow a common set of rules

with the ones cited above. Precisely, it looks for the regularities and mechanisms that

intervene in the distribution of social events such as military conflict, economic devel-

opment or democratic transitions, all of which tend to happen under particular condi-

tions at different points of space and time. The aims of quantitative social science con-

sist in building theories that simplify these conditions by pointing at the specific varia-

bles that might intervene in causing the events under scrutiny.

The final model used in this course, linear regression, offers one possible way of identi-

fying these variables, by looking at how a set of independent (explanatory) variables

can predict a fraction of another dependent (explained) variable. Can we understand,

for example, the spread of tuberculosis in a country by looking only at the different lev-

els of sanitation in a sample of the world? Is it the case that the support of violent ac-

tion decreases with age and education? Are states more likely to be concerned by envi-

ronmental issues when they possess a high level of national wealth? Or is it rather the

case that their attention varies in function of their own exposure to, for instance, natu-

ral disasters?

Thousands of researchers spend their whole lives on similar questions. Several millions

of theories exist on all aspects of the real (natural, material) world.

4.1. Comparison

A fundamental motive behind theory building lies in comparison. Our units of observa-

tions, such as individuals or countries, express different characteristics that can be com-

pared with each other. Nation states, for instance, express various levels of authority

over their citizens, to the point where we can (or at least wish to) distinguish some po-

!

22

litical systems as democracies—structures of authority that are ultimately controlled by

citizens through means such as open elections. Identically, some nation states go

through periods of acute political disruption that lead to social revolutions. Even more

fundamentally, some nation states hardly qualify to that title: the extent to which states

and nations coincide also varies from a country to another. These questions are funda-

mental issues in comparative politics (the selection of issues above come from a course

by David Laitin at Stanford University). Similar research questions structure all other

fields of social science, from economic history to analytical sociology.

To understand the variety of political configurations in (geographical) space and in (his-

torical) time, social science researchers formulate arguments in which they posit explan-

atory factors, which we will call independent variables. Continuing with the examples

above, an early explanation of democracy is Montesquieu’s theory that climate influ-

ences political activity, and an early explanation of social revolution is Marx’s theory of

class structure. Both authors examined particular cases of democracies and revolutions,

and then derived a particular theory from their observations. Modern theories tackle the

same issues, but provide different explanations, using factors such as elections,

state/society interactions, or the precise timing of industrialization in each country un-

der examination.

Advances in social science consist in providing analytically more precise concepts and

typologies for the phenomena under study. Revolutions, for instance, are now studied

under several categories, which distinguish, for instance, “white” (non-violent) revolu-

tions from other ones. By doing so, researchers improve the specification of these phe-

nomena, which we will technically designate as our dependent variables. The deep

anatomy of these social phenomena nonetheless poses a constant challenge to scien-

tists, since before we can start understanding their causes, we need to define and con-

ceptualize complex phenomena such as “civil war”, “counter-insurgency”, “morality”

or “identity”.

The quantitative analysis of social phenomena cannot solve any of these issues, but it

can contribute to improving our knowledge of concept formation, theory building and

comparison across units of observation.

4.2. Theory

Formally, theory building starts with a certain knowledge of scientific advances in a giv-

en field. A certain amount of knowledge already exists, for example, on why young

mothers abort, or on how durable peace occurs and then persists between nation

states. Everything that you know from your previous courses in the social sciences will

be useful in thinking about your data, especially what you have learned in the fields of

demography, economics, public health or sociology. Once previous knowledge has

been considered, however, the unique method of verification that exists for a particular

phenomenon is its observation.

!

23

Several methods of observation coexist. All of them, and not just quantitative methods,

are based on structured comparisons of different units of observation, would it be pro-

testers in a public demonstration, voters in an election, young mothers in an abortion

clinic, national governments in a technological race, or random members of the public.

Observations are then produced either through experimental or through observational

studies, both of which provide a number of facts, such as a response rate to a question

or the adoption of a particular behaviour. These facts are collected in order to build

theories to explain why and how they occur.

When it is impossible to work on all instances of a phenomenon in the material world,

such as the development of cancer cells or the occurrence of revolutions, scientists fo-

cus their attention on carefully selected samples of observations and then generalise

their findings to a larger number of observations. Scientific theories therefore exist for

phenomena as diverse, as common and as important as the democratic election of ex-

treme-right parties, or the effect of radiations on the physiological status of human be-

ings.

These operations of theory building guide scientific inquiry. Additional principles con-

cern the rules under which we construct theoretical models. A crucial rule of science

consists in the suppression of all personal judgement over the data (objectivity), in or-

der to formulate statements that hold generally true rather than only towards a given

end (normativity). Social science is a branch of inquiry where these principles are partic-

ularly difficult to follow, but where they apply nonetheless, and where they allow to

formulate scientific statements on several aspects of social interaction, from suicide ter-

rorism to divorce, from increases in exports to changes in political leadership.

4.3. Quantitative social science

Quantitative approaches to social science apply the aforementioned scientific rules in

order to identify variations in events that involve a number of units such as people,

states, elections or civil wars, and that we describe through a certain number of charac-

teristics that vary from a unit to another—variables.

An example of quantitative result relates to presidential approval: social surveys that

measure the extent to which people tend to support their presidents have found that

economic performance is often very influential in determining that support. Theoretical

models, such as David Easton’s systemic theory of political inputs and outputs, support

that kind of finding. Identically, health expenditure has been measured in Western

countries for several decades. Variations in health spending seem easily explained by

variations in life expectancy, but also by the increasing costs caused by improvements

in medical technology. Current data contribute to explain that phenomenon: health ex-

penditure growth does not directly depend on the age structure of a country and on

the longevity of its residents, but rather on the health status and behaviour of its indi-

viduals, which themselves happen to vary with age.

!

24

Variables appear in these results and in their explanatory theories. Economic perfor-

mance, for instance, is a variable often measured through unemployment levels, gross

domestic product, public deficits and annual changes in per capita disposable income.

Identically, health expenditure and health behaviour are also measured through com-

plex computations of health services supply and demand. The processes and mecha-

nisms that causally connect these variables come from quantitative, qualitative and also

from theoretical research, in order to provide causal efficacy to the correlations that we

observe.

There are multiple sources of error in that process. One of the most important comes

from the measurement of our variables: a survey question can contain unwanted incen-

tives to answer in a particular way, or it can simply be confused and misleading, or the

answer to a question can be misinterpreted. The careful creation of concepts for com-

plex phenomena such as racism, political identity or illness solves part of that issue.

Valid and reliable data are then used to test particular hypotheses, such as the pre-

sumption that education and xenophobia are negatively correlated, or that economic

growth is proportionate to the openness of national economies to all possible competi-

tors. Quantitative social science verifies, or nullifies, these kinds of hypotheses, based

on various sources of data, or statistics.

4.4. Social statistics

Quantitative data come in the form of datasets, which themselves are numeric collec-

tions of variables for a given set of units of observation. The example below is taken

from the U.S. National Health Interview Survey (NHIS):

− The rows hold observations: each row of numeric data designates the answers

of one individual respondent (the unit of observation).

− The columns hold variables: each column designates designate to a particular

question, such as gender, earnings, health status and so on.

In this example, some variables can be ordered: health, for instance, is based on a self-

reported measure that ranges from “poor” to “excellent”. Other variables take values

that cannot be ordered: raceb, for instance, corresponds to the respondent’s racial-

ethnic profile, for which there is no ordering. Other variables have only two possible

values, such as sex (either male or female) or insurance status (either covered or not).

!

25

These are examples of different types of variables that come in addition to ‘purely nu-

meric’ ones like weight, measured in pounds.

Units of observations are not necessarily individuals: they can be anything from organi-

zations to historical events (Section 5.1). Sometimes, not all variables can be measured

for all observations: there will be missing values. The example below is taken from the

Quality of Government (QOG) dataset:

The Quality of Government dataset uses countries as its unit of analysis. Due to several

difficulties with data collection and measurement, it shows an important number of

missing observations for several variables: the “bl_asyt15” and “bl_asyt25” variables,

for instance, measure the average number of schooling years among the population,

and have a high number of missing values. There are also some missing values in the

column for the iaep_es variable, which holds the legislative electoral system for each

observation.

The combined effect of sampling and missing observations forces us to work on a finite

number of observations, which introduces a further risk of error when we start analys-

ing the data. Furthermore, we will use more than one model, as the type of variables

under examination calls for different statistical procedures. Similarly, the number of var-

iables influences these procedures:

− The distribution of one variable, such as the number of democracies observed

at a given point in time or the proportions of each religious group in a given

population in 2004, is captured by univariate statistics. These statistics allow to

calculate to what extent our sample might be different with respect to the uni-

verse of data that we are sampling from, such as the whole population of a

country, all countries or all instances of civil war. This standard error will appear

in all our statistical procedures.

− The relationship between two variables, such as racism and income or national

wealth and defence spending, is addressed by bivariate tests. These tests pro-

vide the probability that a relationship observed within our sample could be

caused by mere chance. This crucial statistic is called the p-value: only when it

stays under a certain level of significance will we accept that an observed rela-

tionship is statistically significant.

− The relationship between two or more variables can also be modelled into an

equation, such as productivity = α · technology + β · education. Formally, mod-

!

26

els include an error term ε in the equation, to account for the sampling error

previously mentioned. Identically to bivariate tests, models also come with a p-

value for us to decide whether they can be confidently followed or not.

These three types of procedures are the essential building blocks of the course, and of

quantitative analysis in general. Because they require thinking about so many different

factors at the same time, they also require to think about data and analysis in a particu-

lar manner—statistical reasoning, the primary teaching objective of this course.

More details on the statistical operations covered in the course appear in the course syl-

labus, which you should read before reading the next section on data. You should also

read a few pages of quantitative social science before going further, as to make sure

that you understand the kind of research that you will be learning to perform, using

some introductory procedures.

4.5. Readings

Depending on your experience with quantitative analysis and on your general themes

of interest, you should read at least four of the texts below, after making your own se-

lection based on personal interests. If you are not familiar with political science, you will

want to include Charles Cameron’s presentation of quantitative analysis in that disci-

pline.

Some additional recommendations apply:

− Do not try to understand in full the methods used by the authors: concentrate

on the style of writing and reasoning instead, as well as on the particular form

of research question that quantitative researchers examine in different disci-

plines.

− If you have little experience with either quantitative analysis or with scientific

writing, you will need more and not less from that list. Actually, you should

read the full list if you have no experience with either, and stop only when you

feel familiar enough with the material.

− The reading of these texts is unmonitored, and left entirely up to you to

organise. You might want to read at least two texts in the first two weeks, then

one more before writing up each assignment, and finally one last before writing

your final paper.

If you are selecting political science as your major interest for the readings, start with

the reading by Cameron, then read either one of the Gelman et al. texts or the Bartels

one, and then read either Jordan or Tavits.

− Larry M. Bartels, Unequal Democracy. The Political Economy of the New Gild-

ed Age, Princeton University Press, 2008, chapter 5.

!

27

In this chapter I explore four important facets of Americans’ views about equali-

ty. First, I examine public support for broad egalitarian values, and the social ba-

ses and political consequences of that support. Second, I examine public atti-

tudes toward salient economic groups, including rich people, poor people, big

business, and labor unions, among others. As with more abstract support for

egalitarian values, I investigate variation in attitudes toward these groups and

the political implications of that variation. Third, I examine public perceptions of

inequality and opportunity, including perceptions of growing economic inequali-

ty, normative assessments of that trend, and explanations for disparities in eco-

nomic status. Finally, I examine how public perceptions of inequality, its causes

and consequences, and its normative implications are shaped by the interaction

of political information and political ideology.

− Charles Cameron, “What is Political Science?” in Andrew Gelman and Jeronimo

Cortina, A Quantitative Tour of the Social Sciences, Cambridge University

Press, 2009, chapter 15.

Politics is part of virtually any social interaction involving cooperation or conflict,

thus including interactions within private organizations (“office politics”) along

with larger political conflicts. Given the potentially huge domain of politics, it’s

perfectly possible to talk about “the politics of X,” where X can be anything

ranging from table manners to animal “societies.” But although all of these are

studied by political scientists to some extent, in the American academy “political

science” generally means the study of a rather circumscribed range of social

phenomena falling within four distinct and professionalized fields: American pol-

itics, comparative politics, international relations, and political theory (that is,

political philosophy).

− Ashley M. Fox, “The Social Determinants of HIV Serostatus in Sub-Saharan Af-

rica: An Inverse Relationship Between Poverty and HIV?” Public Health Reports

125(s4), 2010.

Contrary to theories that poverty acts as an underlying driver of human immu-

nodeficiency virus (HIV) infection in sub-Saharan Africa (SSA), an increasing

body of evidence at the national and individual levels indicates that wealthier

countries, and wealthier individuals within countries, are at heightened risk for

HIV. This article reviews the literature on what has increasingly become known

as the positive-wealth gradient in HIV infection in SSA, or the counterintuitive

finding that the poor do not have higher rates of HIV. This article also discusses

the programmatic and theoretical implications of the positive HIV-wealth gradi-

ent for traditional behavioral interventions and the social determinants of health

literature, and concludes by proposing that economic and social policies be lev-

eraged as structural interventions to prevent HIV in SSA.

− Andrew Gelman et al., Red State, Blue State, Rich State, Poor State. Why Amer-

icans Vote the Way They Do, Princeton University Press, 2008, chapter 2.

!

28

This book [chapter] was ultimately motivated by frustration at media images of

rich, yuppie Democrats and lower-income, middle-American Republicans—

archetypes that ring true, at some level, but are contradicted in the aggregate.

Journalists are, we can assume, more informed than typical voters. When the

news media repeatedly make a specific mistake, it is worth looking at. The per-

ception of polarization is itself a part of polarization, and views about whom the

candidates represent can affect how political decisions are reported. And, as we

explore exhaustively, the red–blue culture war does seem to appear in voting

patterns, but at the high end of income, not the low, with educated profession-

als moving toward the Democrats and managers and business owners moving

toward the Republicans.

− David Karol and Edward Miguel, “The Electoral Cost of War: Iraq Casualties and

the 2004 U.S. Presidential Election”, Journal of Politics 69(3), 2007.

Many contend that President Bush’s reelection and increased vote share in 2004

prove that the Iraq War was either electorally irrelevant or aided him. We pre-

sent contrary evidence. Focusing on the change in Bush’s 2004 showing com-

pared to 2000, we discover that Iraq casualties from a state significantly de-

pressed the President’s vote share there. We infer that were it not for the ap-

proximately 10,000 U.S. dead and wounded by Election Day, Bush would have

won nearly 2% more of the national popular vote, carrying several additional

states and winning decisively. Such a result would have been close to forecasts

based on models that did not include war impacts. Casualty effects are largest in

“blue” states. In contrast, National Guard/Reservist call-ups had no impact be-

yond the main casualty effect. We discuss implications for both the election

modeling enterprise and the debate over the “casualty sensitivity” of the U.S.

public.

− Rachel Margolis and Mikko Myrskyla “A Global Perspective on Happiness and

Fertility”, Population and Development Review 37(1), 2011.

The literature on fertility and happiness has neglected comparative analysis. we

investigate the fertility/happiness association using data from the world values

Surveys for 86 countries. we find that, globally, happiness decreases with the

number of children. this association, however, is strongly modified by individual

and contextual factors. most importantly, we find that the association between

happiness and fertility evolves from negative to neutral to positive above age

40, and is strongest among those who are likely to benefit most from upward

intergenerational transfers. in addition, analyses by welfare regime show that

the negative fertility/ happiness association for younger adults is weakest in

countries with high public support for families, and the positive association

above age 40 is strongest in countries where old-age support depends mostly

on the family. overall these results suggest that children are a long-term invest-

!

29

ment in well-being, and highlight the importance of the life-cycle stage and

contextual factors in explaining the happiness/fertility association.

− Patrick Sturgis and Patten Smith, “Fictitious Issues Revisited: Political Interest,

Knowledge and the Generation of Nonattitudes”, Political Studies 58(1), 2010.

It has long been suspected that, when asked to provide opinions on matters of

public policy, significant numbers of those surveyed do so with only the vaguest

understanding of the issues in question. In this article, we present the results of

a study which demonstrates that a significant minority of the British public are,

in fact, willing to provide evaluations of non-existent policy issues. In contrast to

previous American research, which has found such responses to be most preva-

lent among the less educated, we find that the tendency to provide ‘pseudo-

opinions’ is positively correlated with self-reported interest in politics. This effect

is itself moderated by the context in which the political interest item is adminis-

tered; when this question precedes the fictitious issue item, its effect is greater

than when this order is reversed. Political knowledge, on the other hand, is as-

sociated with a lower probability of providing pseudo-opinions, though this ef-

fect is weaker than that observed for political interest. Our results support the

view that responses to fictitious issue items are not generated at random, via

some ‘mental coin flip’. Instead, respondents actively seek out what they con-

sider to be the likely meaning of the question and then respond in their own

terms, through the filter of partisan loyalties and current political discourses.

!

30

Part 1

Data

Quantitative data is a particular form of data that simplifies information into variables

that take different types and values. Some variables, such as gross domestic product or

monthly income, hold continuous data that are strictly numeric, while others hold cate-

gorical data, such as social class for individuals or political regime type for nation states.

The collection of data for quantitative analysis systematically creates issues of meas-

urement and reliability that also apply to qualitative research. These issues are usually

explored by classes that focus on social surveys and research design. In this course, we

assume that you already know of some of the issues that apply to data collection.

Manipulating a dataset is a complex task that requires some familiarity with the struc-

ture of the data, with the software commands available to prepare the data, and with

the research design in which your analysis will take place. Using predefined datasets, as

we will in this course, will simplify these operations a great deal, but will not entirely

suppress them.

This section describes the essential steps that you should follow to prepare your data

before starting your analysis. The four sections are better read as just one block of the

guide, as they frequently overlap. If you are using a pre-assembled Stata dataset that

comes in cross-sectional format, skip Sections 7.1 to 7.3.

!

31

5. Structure

In quantitative environments, information is stored in datasets that hold observations

and variables. Understanding the structure of your data is an absolute requirement to

its analysis, for the following reasons:

− The studies that motivate data collection have different goals. This course will

cover only observational studies using cross-sectional data (Section 5.1).

− The observations contained in a dataset generally consist in a sample taken

from a larger population. The representativeness of your data depends on how

that sample was initially constructed (Section 5.2).

− The variables of a dataset consist of numerical, text or missing values assigned

to each observation, following a consistent level of measurement (Section 5.3).

This section briefly reviews each of these aspects.

5.1. Studies

All quantitative studies use samples, variables and values, but distinctions apply among

them given the wider research strategy for which the data were collected. The principal

issue at stake is the type of randomization employed in the study:

− Experimental studies designate research designs where the observer is able to

interact with the subjects or patients that compose the sample. Experimental

settings are common in psychology and clinical studies, where subjects or pa-

tients are often randomly assigned to a ‘treatment’ and a ‘control’ group to

study the effects of a particular drug or setting on them. These studies generally

rely on small samples and on an analysis of variance (ANOVA).

− Observational studies designate research designs where the observer is not able

to interact with the sample. The randomized component does not have to do

with assigning treatments but with randomly sampling observations, which are

most often individuals from a given population. Such studies are extremely

common in research that focuses on social and political ‘treatments’ (such as

environmentalism or drug addiction) that cannot be assigned to subjects.

This course explores non-experimental data collected in observational studies. A further

distinction applies between these studies, depending on the period of observation for

which the data were collected:

− Cross-sectional studies are collected at one particular point in time and provide

‘snapshots’ of data in a given period, such as political attitudes in the American

population a few days after September 2001, or health expenditure levels in EU

member states in 2010–2011.

!

32

− Time series are collected at repeated points in time. In the case of cross-

sectional time series (CSTS), a different sample is collected at each point. If the

same sample is used throughout, the study provides longitudinal information

on a given ‘panel’ or ‘cohort’, such as U.S. households or OECD countries.

This course will focus on cross-sectional data, which are the most readily available be-

cause of the sunk costs of collecting longitudinal data. Many common forms of surveys,

such as opinion polls, are cross-sectional, although larger research surveys often have a

panel component that involve following a group of individuals over several years.

Cross-sectional data has its own statistical limits. Although it allows comparing across

observations, it provides no information on the changes that occur through time within

and between the units of analysis. That information appears in longitudinal data, which

require additional statistical methods outside of the scope of this course.

5.2. Sampling

An observation is one single instance of the unit of analysis. The unit of analysis is a

unique entity for which the data were collected, and can be virtually anything as long

as a clear definition exists for it. Voters, countries or companies are common units of

analysis, but events like natural disasters and civil wars are also potential candidates.

The definition of the unit of analysis sets the population from which to sample from.

For instance, if you are studying voting behaviour in France, your study is likely to apply

only to the French adult population that was allowed to vote at the time you conducted

your research. Dataset codebooks usually discuss these issues at length.

The sample design then sets how observations were collected. The various techniques

that apply to sampling form a crucial component of quantitative methods, that can be

broken down to a few essential elements that you will need to understand in order to

assess the representativeness of your data:

− The sample size designates the number of observations, noted N, contained in

the dataset. Since variables often have missing values, large segments of your

analysis might run on lower number of observations than N (Section 6.3).

Sample size affects statistical significance through sampling error, which charac-

terises the difference between a sample parameter, such as the average level of

support for Barack Obama in a sample of N respondents, and a population pa-

rameter, such as the actual average level of support for Barack Obama in the

full population of U.S. voters (the size of which we might know, or not).

The Central Limit Theorem (CLT) shows that repeated sample means are nor-

mally distributed around the population mean.

Sampling error is calculated using the standard error, from which are derived

confidence intervals for parameters such as the sample mean. The standard er-

!

33

ror decreases either with the level of confidence of an estimate, or with the

square root of the sample size. Consequently, the law of large numbers applies:

larger sample sizes will approach population parameters better and are prefera-

ble to obtain robust findings.

− The sampling strategy designates the method used to collect the units con-

tained within the sample (i.e. the dataset) from a larger universe of units, which

can be a reference population such as adult residents in the United States or all

nation-states worldwide at a given point in time.

Sampling strategy has an impact on representativeness. Surveys often try to

achieve simple random sampling to select observations from a population, us-

ing a method of data collection designed to assign each member of the popula-

tion an equal probability of being selected, so that the results of the survey can

be generalised to the population.

Random or systematic sampling from particular strata or clusters of the popula-

tion are among the methods used by researchers to approximate that type of

representativeness. These methods can only approximate the whole universe of

cases, as when a study ends up containing a higher proportion of old women

than the true population actually does, which is why observations in a sample

will be weighted in order to better match the sample with its population of ref-

erence.

Other methods of data collection rely on nonprobability sampling. When the

unit of analysis exists in a small universe, such as states or stock market compa-

nies, or when the study is aimed at a particular population, such as Internet us-

ers or voters, the sampling strategy targets specific units of analysis, with results

that are not necessarily generalizable outside of the sample.

The sampling strategy can correct for design effects such as clustering, system-

atic noncoverage and selection bias, all of which negatively affect the repre-

sentativeness of the sample. Representativeness can be obtained through care-

ful research design and weighted sampling. Stata handles complex survey de-

sign with several weights options passed to the svyset and svy: commands,

both covered at length in the Stata documentation.

Important: Neither sample size or sampling strategy will remove measurement errors

that occur at earlier or later stages of data collection. Representativeness is only one

aspect of survey design. On the one hand, it is technically possible to collect repre-

sentative answers to very poorly written survey questions that will ultimately measure

nothing. Ambiguously worded questions, for instance, will trigger unreliable answers

that will cloud the results regardless of the statistical power and representativeness car-

ried by the sample size and strategy. There is no statistical solution to the bias induced

by question wording and order. On the other hand, coding and measurement errors

can reduce the quality of the data in any sample: again, representativeness does not

!

34

control for such issues. The “garbage in, garbage out” principle applies: poorly de-

signed studies will always yield poor results, if any.

Application 5a. Weighting a cross-national survey Data: ESS 2008

The European Social Survey (ESS) contains a design weight variable (dweight) to ac-

count for the fact that some categories of the population are over-represented in its

sample. The table below was obtained by selecting a few observations from the study,

using the sample command with the count option to draw a random subsample of 10

observations; the list command was then used to display the country of residence,

gender and age of each respondent in this subsample, along with the design and popu-

lation weights.

The ESS documentation describes dweight (design weight) as follows:

Several of the sample designs used by countries participating in the ESS were

not able to give all individuals in the population aged 15+ precisely the same

chance of selection. Thus, for instance, the unweighted samples in some coun-

tries over- or under-represent people in certain types of address or household,

such as those in larger households. The design weight corrects for these slightly

different probabilities of selection, thereby making the sample more representa-

tive of a ‘true’ sample of individuals aged 15+ in each country.

By looking at the subsample listed above, you can spot two observations for which the

dweight variable is inferior to 1: both are females who were drawn from households

that are over-represented by the sampling strategy used by the ESS, and are therefore

assigned design weights under 1. Conversely, the other Russian female, aged 19, was

10. RU M 41 1.1936 4.82

9. CH M 31 1.0835 0.35

8. GB M 54 1.0141 2.14

7. CY F 47 0.7652 0.05

6. RU F 19 2.1556 4.82

5. RU F 25 0.4878 4.82

4. NL M 24 1.0099 0.76

3. CH M 52 1.0826 0.35

2. CH F 51 1.0826 0.35

1. UA M 18 1.3027 2.15

cntry gndr agea dweight pweight

. list cntry gndr agea dweight pweight

(51132 observations deleted)

. sample 10, count

!

35

drawn from a very under-represented household, and her assigned design weight is

therefore above 2. These weights, when used with the [weight] operator or svyset

command, ensure that these observations are given more or less importance when us-

ing frequencies and other aspects of the data, as to compensate for their under- or

over-representation in the ESS sample in comparison to the actual population from

which they were drawn.

The ESS documentation describes pweight (population weight) as follows:

This weight corrects for the fact that most countries taking part in the ESS have

very similar sample sizes, no matter how large or small their population. With-

out weighting, any figures combining two or more country’s data would be in-

correct, over-representing smaller countries at the expense of larger ones. So

the Population size weight makes an adjustment to ensure that each country is

represented in proportion to its population size.

By looking again at the data above, you can indeed observe that the two respondents

from Ukraine and Britain, two countries with large populations, have population

weights around 2, and that the three respondents from Russia have an even higher

population weight, whereas small countries like Cyprus or Switzerland have much

smaller values. This makes sure that, when calculating the frequencies of a variable over

several countries (such as the percentage of right-wing voters in Europe), the actual

population size of each country is taken into account.

In conclusion, to weight the data at the European level, we need to account for both

design and population weights. Design weights correct for the over- and under-

representation of some socio-demographic groups, and population weights make sure

that each national population accounts for its fraction of the overall European popula-

tion.

This is done with the svyset command by creating a multiplication of both weights and

using it as a probability weight:

While population and design weights are pretty straightforward in this example, sur-

veys can reach high levels of complexity when researchers try to capture multistage

contexts by sampling from several strata and clusters of the target population. For ex-

ample, large demographic surveys will often sample cities, and then sample neighbour-

FPC 1: <zero>

SU 1: <observations>

Strata 1: <one>

Single unit: missing

VCE: linearized

pweight: wgt

. svyset [pw=wgt]

. gen wgt=dweight*pweight

!

36

hoods within them, and then sample households within them, and finally sample adults

within them.

Application 5b. Weighting a multistage probability sample Data: NHIS 2009

The National Health Interview Survey (NHIS) is a good example of “a complex, multi-

stage probability sample that incorporates stratification, clustering, and oversampling of

some subpopulations” for some of its available years of data. It would take too much

space to document the study fully, but the most basic weight, perweight, provides a

good example of how weights are constructed:

This weight should be used for analyses at the person level, for variables in

which information was collected on all persons. [The weight] represents the in-

verse probability of selection into the sample, adjusted for non-response with

post-stratification adjustments for age, race/ethnicity, and sex using the [U.S.]

Census Bureau's population control totals. For each year, the sum of these

weights is equal to that year’s civilian, non-institutionalized U.S. population.

More documentation from the NHIS then introduces strata, which “represents the im-

pact of the sample design stratification on the estimates of variance and standard er-

rors,” and psu, which “represents the impact of the sample design clustering on the

estimates of variance and standard errors.” Both parameters would require a course on

survey design to be fully explained. In the meantime, they can be passed to the svyset

along with perweight, the sampling weight:

5.3. Variables

A variable is any measurement that can be described using more than one numeric

value. The value should hence vary across observations to make up for a variable.

Each variable is defined by a range of possible values. At the most basic level, some var-

iables are considered quantitative because their values can be ordered meaningfully into

levels, and some variables are considered qualitative because there is no substantively

significant ordering of their values. This distinction is rather imprecise and relatively mis-

leading, which is why we will use a more advanced classification of variables below.

The level of measurement used by each variable in your dataset is the very first thing

that you need to understand about the data before analysing it:

FPC 1: <zero>

SU 1: psu

Strata 1: strata

Single unit: missing

VCE: linearized

pweight: perweight

. svyset psu [pw=perweight], strata(strata)

!

37

− A nominal scale qualifies a variable that was measured using discrete categories

that cannot be objectively ordered. Examples of nominal variables are religious

beliefs and legal systems: there is no objective ordering of “Jewish” and “Mus-

lim”, and “English Common Law” is a discrete category from “French Commer-

cial Code”.

A specific nominal scale uses dichotomous categories, which result in binary

variables that can take only 0 or 1 as values. Examples of binary variables are

sex and democracy: an individual person is either female (1) or not (0), and a

political regime is either democratic (1) or not (0). The discrete values denote a

nominal difference, not an objective order.

− An ordinal scale qualifies a variable that was measured using categories that can

be ordered regardless of their distance. Examples of ordinal variables are educa-

tional attainment and internal conflict: ‘primary school’, ‘secondary school’ and

‘university degree’ can be ordered, just like ‘low’, ‘medium’ and ‘high’ internal

conflict.

Ordinal scales do not reflect meaningful distances between their categories. For

example, the difference in educational attainment between ‘primary school’ and

‘secondary school’ is not equal to the difference in educational attainment be-

tween ‘secondary school’ and ‘university degree’. The interval between the cat-

egories is variable.

− An interval scale qualifies a variable that was measured using categories that

can be ordered at equal distance. Examples of interval variables are age groups

and approximate indexes: the same distance exists between the “15–19”, “20–

24” and “25–29” age groups, as well as between each category of Transparen-

cy International’s Corruption Perceptions Index, which ranges from 0 (highly

corrupt) to 10 (highly clean).

Interval variables do not have an absolute zero, insofar as the first level of the

interval is relative and does not designate a meaningful zero point. For example,

being in the lowest age group does not literally signify being 0-year-old, just as

being in the highest category of the Corruption Perceptions Index does not indi-

cate that the level of corruption is 0%.

− A ratio scale qualifies a variable that was measured against a numeric scale with

an absolute zero point. Examples of ratio variables are income and inflation: ei-

ther of them can take 0 as a substantive value, indicating the absence of income

and a 0% inflation rate respectively. Variables of this type might be thought of

as ‘purely’ continuous.

The level of measurement is a determinant aspect in statistical models, as it will deter-

mine how to describe, analyse and interpret each variable. The type of the dependent

variable is particularly important, and a simpler classification applies:

!

38

− Continuous variables hold values for which we can calculate counts or ratios.

Examples of continuous variables are number of children and economic growth:

an individual can have any number of children, and a state can experience any

percentage of economic growth. In both cases, we can meaningfully compare

the values across observations.

Some distinctions apply within continuous data: count data holds positive inte-

ger values, as applies to the number of children, since it is impossible to have

‘7.5 children’ or ‘–3 children’ but only { 1, 2, 3… n } children. Continuous varia-

bles like economic growth can take virtually any value, from – ∞ to + ∞, even

though they empirically exist in a more restricted range.

− Categorical variables hold values for which we can only observe discrete cate-

gories. In statistical modelling, however, any categorical variable with an ordinal

or interval scale can be treated as pseudo-continuous, and the categorical classi-

fication will finally apply only to nominal variables. This course will often refer to

continuous data in this looser sense to include ordinal and interval variables.

The dependent variable in your research project should ideally be ‘purely’ con-

tinuous (ratio), but ordinal, interval and count variables are also possible candi-

dates, since linear regression will also function for these types of data. More ad-

vanced models exist to better handle categorical data, but they are beyond the

scope of this course.

Further instructions apply to variable manipulation, as you will often be required to

modify the variables in your dataset. These are described in Section 8, but the “garbage

in, garbage out” principle still applies: poorly designed studies cannot be rescued by

good data manipulation.

5.4. Values

The “number-crunching” aspect of quantitative research methods is due to the fact

that all the information considered by the analyst will come in the form of numbers ra-

ther than text (called “strings” in computer environments). Numeric values in a dataset

can point to three different kinds of data:

− Continuous data are stored in numeric values, using integer and float formats.

The unit of measurement for the values, such as years for a variable describing

age or percentage points for a variable describing gross domestic product, are

not stored with the values but are often indicated in the variable label.

When necessary, as with cross-tabulations (Section 10), continuous data can

easily be turned into categorical data using the recode command. Variables such

as age or income, for example, can be better crosstabulated in the form of age

or income groups. When performing linear regression (Section 11), however,

continuous data are more appropriate.

!

39

− Categorical data are also stored in numeric values, to which labels are assigned.

For example, a possible variable for gender will take the value 1 to for males

and 2 for females—although a better encoding would consist in using a binary

variable called female that would code 0 for males and 1 for females.

The recode command allows creating new variables by assigning new value la-

bels to the data, based on existing ones. For example, if your data contains a

variable measured on an ordinal scale of ten categories from 1 ‘Strongly agree’

to 10 ‘Strongly disagree’, this scale can be recoded into a more simple scale of

five categories, or even into a binary variable.

− Missing data are observations for which the variable takes no value due to an

issue that arose at the level of data collection, such as respondents being unable

(or refusing) to answer, or insufficient information to measure the value. Stata

identifies missing data with the “.” character.

When missing data are not coded by “.” but by numeric values, such as “-1” or

“999” for variables that cannot take these values, you will have to use the

replace command in Stata to change this coding to “.” coding, as illustrated in

Section 8.2.

Missing values will actually require a lot of attention at all points of your analy-

sis, in order to avoid all sorts of calculation and interpretation errors when look-

ing at frequencies and crosstabulations. Furthermore, missing values will con-

strain the number of observations available for correlation and regression analy-

sis.

As suggested by this description of values, a very important part of quantitative analysis

consists in learning about the exact coding of the data in order to better manipulate

variables later on. The practical aspect of that task (modifying values and labels) is cov-

ered in more detail in Section 8.

The substantive aspect of that task relies entirely on the analyst. Reading from the da-

taset codebook is essential to understand how the values of each variable were ob-

tained. For example, issues of measurement and reliability will inevitably exist with ag-

gregate indices, self-reported data and psychometric scales.

Application 5c. Reading frequencies Data: ESS 2008

The third-party fre command displays frequencies for a given variable in a better way

than the built-in tab command does. After installing the command, we look at attitudes

towards immigration from outside Europe in a sample of European respondents:

!

40

− The observations are weighted, which explains why the frequencies are not in-

tegers but still sum up to N = 51142.

We used the Stata [aw] suffix, which allows the use of weights like the design

and population weights in the ESS (Application 5a). Since we are looking at the

whole sample of European respondents, we used both weights, as recommend-

ed in the ESS documentation.

− The variable under examination is an ordinal one, with four categories that can

be ordered by their degree of tolerance towards immigrants.

If we wanted to isolate the last categories, ‘Allow few/none’, in order to focus

on respondents who are most resilient to immigration, we could recode the var-

iable as a binary one, using a dichotomous separation between ‘Allow

some/many’ and ‘Allow few/none’.

− There are missing values, coded as .a, .b and .c. These are variations of the “.”

Stata format for missing data (examined again in Section 8).

The difference in letters is used to code for different types of missing data: as

the ESS documentation explains, .a codes for respondents who refused to an-

swer, .b codes for respondents who did not know what to answer (often called

‘DK’ or ‘DNK’), and .c codes for respondents who did not answer (‘NA’).

Detailed missing values can hint at why there is missing data in the first place:

here, most of the 6% missing data come from respondents who declared not

having, or not being able to form, an opinion on the topic, rather than from re-

spondents refusing to answer, which is common when the question touches

upon a topic affected by desirability bias (i.e. when some answers are more pos-

itively or negatively connoted than others).

*

Total 51142 100.00

Total 3103.388 6.07

.c 24.52225 0.05

.b 2938.206 5.75

Missing .a 140.6597 0.28

Total 48038.61 93.93 100.00

4 Allow none 9688.713 18.94 20.17 100.00

3 Allow a few 15799.46 30.89 32.89 79.83

2 Allow some 16460.09 32.19 34.26 46.94

live here

Valid 1 Allow many to come and 6090.343 11.91 12.68 12.68

Freq. Percent Valid Cum.

impcntr Allow many/few immigrants from poorer countries outside Europe

. fre impcntr [aw=dweight*pweight]

!

41

A serious issue in scientific inquiry is overreliance on a limited number of data sources

and methods of study. If you are willing to spend some time exploring around, then

quantitative analysis will expand your abilities on both counts. Your skills with data and

methods are not just part of your academic curriculum: if you care enough to maintain

your level of knowledge in that area, they will stay with you all your life. My experience

with these skills shows that they have both personal and professional value.

The first step in acquiring those skills consists in training yourself to work with quan-

titative data. As with most activities, there is no substitute for training: your familiarity

with quantitative data and methods primarily reflect the time you spent on them. With

a few key terms of interest in mind, you should therefore start exploring data as soon

as you can. At its most basic, this implies accessing and downloading a selection of da-

tasets onto your computer.

Prior to opening the datasets, take a look at their documentation files. Do not aim at

understanding every single aspect of the documentation: focus on survey design and

sampling, which should be fully documented in the data codebook. The next sections

will then guide you through data exploration and management: Section 6 explains how

to explore a dataset, Section 7 explains how to prepare it for analysis, and Section 8

covers further data management operations with variables.

!

42

6. Exploration

Exploring quantitative data requires either assembling your own data (an option not

covered in this course) or locating some pre-assembled datasets online. The diffusion of

quantitative data has made tremendous progress in the past decade, and an amazing

range of – often underused – datasets are available online.

The socio-political and technological determinants of the current ‘data deluge’, as an

article in The Economist once put it, are outside of the scope of this guide, but a lot of

online commentary and analysis exists on the ‘data revolution’ in science, journalism

and government (check Victoria Stodden’s work first).

6.1. Access

The course makes extensive of a few recommended datasets that were selected based

on several criteria ranging from topical interest to simplicity and quality. For your own

project, you will be first offered to work with recent versions of the European Social

Survey (ESS) and Quality of Government (QOG) data.

If you plan to go beyond these recommended sources, turn to the course material to

learn more on data repositories and data libraries. High-quality data is still rare, and

good sources to look for such data are the ICPSR and CESSDA repositories, listed on

the course website.

Important aspects of data retrieval include the following:

− Always download the documentation for your data. Professional-quality da-

tasets come with extensive codebooks that help with understanding the data

structure, as well as with other notes on the data itself.

− Never rely on any source to preserve the data for you. Even if the integrity of

data repositories is improving, always keep a pristine (intact) copy of the (origi-

nal) datasets that you use in your personal archives.

− Full acknowledgment of the source is an ethical counterpart. In order to make

a legitimate use of datasets for either research or teaching purposes, reference

the source in full and follow all related instructions.

6.2. Browsing

The simplest way to quickly explore your data is to open the Data Editor after you

loaded your dataset: type in browse (or edit if you plan to modify the data) in the

Command window, and the Data Editor window will open. Alternatively, use the Ctrl-8

(Windows) or Cmmd-8 (Macintosh) keyboard shortcut.

!

43

The variables contained in Stata datasets can be explored with the codebook command.

That same command will also return information about the dataset itself, as will the

notes command if the dataset comes with Stata data notes. It is not, however, very

practical to explore data with these tools.

Instead, use the following commands to start exploring your data:

− Always start with the describe command, which can be abbreviated to its sin-

gle-letter shorthand, ‘d’. Simply typing d into Stata will return the list of varia-

bles, and typing d followed by a list of one or many variable names will describe

only these. High-quality datasets should always come with intelligent variable

names and labels.

− When the list of variables is too long to be inspected in full, use the lookfor

command to search for keywords in the names and labels of variables. The

keywords need not be complete terms: ‘immig’ will work fine, for instance. You

can use any number of keywords with lookfor: be aware of synonyms and try

different possibilities.

− Finally, learn how to use the rename command (shorthand: ren) as soon as pos-

sible. When you identify a variable of interest, there is a fair probability that its

name will be some kind of strange acronym or something even less comprehen-

sible, like ‘v241’ or ‘s009’. Renaming the variable will help solving that issue.

In some situations, you might also want to use the sort and order commands, respec-

tively to sort the observations according to the values of one particular variable, or to

reorder the variables in the dataset. Turn to the Stata help pages for the full documen-

tation of these commands.

Application 6a. Locating and renaming variables Data: ESS 2008

‘Microdata’ is a term that generally refers to data based on individual respondents. At

that level of analysis, common demographic and socioeconomic variables include age,

gender, income and education.

We used the lookfor command to identify variables that refer to income and earnings,

which we did not type in full to allow for all ‘earn-‘ terms to show up in the results:

!

44

Once the variable of interest has been identified, we use the rename (shorthand ren)

command to give it a more explicit name. When successfully run, the command does

not send back any output:

After doing the same for other variables and writing all ren commands to our do-file,

we obtain a list of variables that can be described as follows:

6.3. Selections

At several points of your analysis, you will want to apply commands to selected parts of

your data, as in the case where you might want to summarize a variable only for a se-

lected category of subjects in a survey. In that case, you will be using the if conditional

statement.

The if statement works by adding a specific condition written with mathematical signs

to indicate equality or inequality, summarised below:

==

equal to

>

greater than

nowadays

hincfel byte %1.0f hincfel Feeling about household's income

sources

hinctnta byte %2.0f hinctnta Household's total net income, all

hincsrca byte %2.0f hincsrca Main source of household income

hincsrc byte %2.0f hincsrc Main source of household income

differences in income levels

gincdif byte %1.0f gincdif Government should reduce

variable name type format label variable label

storage display value

. lookfor income earn

. ren hinctnta income

sources

income byte %2.0f hinctnta Household's total net income, all

completed

edu byte %2.0f eduyrs Years of full-time education

gender byte %1.0f gndr Gender

age int %3.0f agea Age of respondent, calculated

variable name type format label variable label

storage display value

. d age gender edu income

!

45

!=

not equal to

<

less than

mi()

missing

>=

greater than or equal to

!mi()

not missing

<=

less than or equal to

Conditions can be combined to each other by using two logical operators:

&

and

|

or

The use of conditions is pretty intuitive, except for more elaborate patterns that use

brackets to create sophisticated conditions that we should not need for this course. It is

important to be able to use conditions, since they apply to almost all operations that we

will use for analysis.

Application 6b. Counting observations Data: ESS 2008

The count command simply counts observations in a dataset, based on a given condi-

tion. Without any condition, it just counts all observations:

If we are interested in knowing how many observations the dataset includes for re-

spondents strictly over 64 years-old, we type the following, which uses the renamed

variables from Application 6a:

A slight issue here is that Stata counts missing values encoded as “.” as positive infinity,

which means that the above command included the missing values of the variable in its

count of observations over value 64 for the “age” variable.

The following command recounts respondents strictly over 64 years old without these,

by combining two conditions with the conjunctive & (“and”) operator:

The last command literally translates as: count observations for which the age variable

takes a value strictly over 64 and is non-missing. Missing values often appear in differ-

ent forms than “.”, which is another issue that we will learn to solve in Section 8.4.

Application 6c. Selecting observations Data: ESS 2008

Turning to gender and education, we want to look at the average schooling years of

young females. We start with the list command to show a fraction of the data, and

51142

. count

10949

. count if age > 64

10803

. count if age > 64 & !mi(age)

!

46

then compute summary statistics with the su command to learn the basics about the

distribution of the edu variable:

The list… in… command is purely exploratory: it allows you to take a glance at a few

lines of data in the same way as browse or edit would let you do from the Stata Data

Browser/Editor window. The clean option is just a cosmetic fix.

If we are interested in the average schooling years for male and female adult respond-

ents below 65 years-old, we form a conjunctive conditional statement for age and use

the bysort command to separate respondents by gender:

edu 50682 11.96253 4.225673 0 50

Variable Obs Mean Std. Dev. Min Max

. su edu

10. 57 M 16

9. 49 M 16

8. 28 F 17

7. 19 F 13

6. 32 F 12

5. 27 M 13

4. 77 F 15

3. 69 M 18

2. 26 F 15

1. 36 M 18

age gender edu

. list age gender edu in 1/10, clean

edu 4 12.25 1.5 11 14

Variable Obs Mean Std. Dev. Min Max

-> gender = .a

edu 20461 12.62538 4.011698 0 40

Variable Obs Mean Std. Dev. Min Max

-> gender = F

edu 17890 12.73181 3.836265 0 48

Variable Obs Mean Std. Dev. Min Max

-> gender = M

. bysort gender: su edu if age >= 18 & age < 65

!

47

Note that we did not use the & !mi(age) conditional statement, because Stata will not

include missing values in the (18-65] interval as it would in the (65; +∞] interval formed

in the previous command.

The use of conditionals is very common at all stages of analysis. The ‘or’ logical state-

ment also becomes useful when selecting observations based on the values taken by a

categorical variable, which can be explored with the fre command:

In the command above, we tabulated the countries (cntry) of residence of the respond-

ents to the ESS. The variable cntry is a nominal variable encoded as text: no numeric

value exists for the labels “BE” (Belgium, for which the dataset holds a total number of

observations of N = 1760 respondents), “BG” (Bulgaria, N = 2230), … “TR” (Turkey, N

= 2416) and “UA” (Ukraine, N = 1845). This means that we will need to use strings

(text) in “double quotes” to pass a command to them with Stata, as in this count of

Greek respondents:

If our analysis were to focus on the average level of male and female education in

Greece and Cyprus, we would run a command using a disjunctive | (“or”) operator to

include respondents from both respondents:

Total 51142 100.00 100.00

UA 1845 3.61 3.61 100.00

TR 2416 4.72 4.72 96.39

SK 1810 3.54 3.54 91.67

SI 1286 2.51 2.51 88.13

: : : : :

CY 1215 2.38 2.38 13.73

CH 1819 3.56 3.56 11.36

BG 2230 4.36 4.36 7.80

Valid BE 1760 3.44 3.44 3.44

Freq. Percent Valid Cum.

cntry Country

. fre cntry, rows(9)

2072

. count if cntry=="GR"

!

48

Of course, if the whole analysis is focused on Greece and Cyprus, we would first think

of subsetting the data to these countries only. This matter is covered in Section 7, along

with other instructions about dataset manipulation. Issues to do with variable coding,

such as encoding string variables or manipulating labels, are covered in Section 8. Final-

ly, the su and fre commands are explained again in detail when covering distributions

and frequencies in in Section 9.

edu 0

Variable Obs Mean Std. Dev. Min Max

-> gender = .a

edu 1723 11.37725 3.925022 0 24

Variable Obs Mean Std. Dev. Min Max

-> gender = F

edu 1544 11.83679 3.988876 0 24

Variable Obs Mean Std. Dev. Min Max

-> gender = M

. bysort gender: su edu if cntry=="GR" | cntry=="CY"

!

49

7. Datasets

Stata datasets are characterised both by the DTA dataset file format and by the way

that the data are arranged within the file:

− File format. Your dataset should come as a single file in Stata .dta format. If

your data come in any other format, you will have to convert it (Section 7.1). If

your data come in more than one file, you will need to merge all components

into one file (Section 7.2).

Many issues can appear during dataset conversion, such as text conversion er-

rors with accents or other characters, or mismatches in merged data; these is-

sues will require manual fixing or using advanced editing techniques that are

beyond the scope of this course.

− Data format. The rows of your dataset should hold your units of observation

(most commonly of which, individuals or states) and its columns should hold

your variables (such as sex or country name). To quickly check that structure,

open the Data Editor by typing browse.

If your data are formatted as time series, with variables in rows and values for

each time unit (such as years) in columns, you will need to use the reshape

command (Section 7.3). This often happens with country-level data measured

over several years.

Finally, for the purposes of this course, you are required to work on cross-

sectional data that were collected at only one point in time. If your dataset con-

tains time series or any form of longitudinal study, you will need to subset it to a

single time period (Section 7.4).

Important: data management is time-consuming, error-prone and complex. All the op-

erations described in this section, at the exception of subsetting, will draw a lot of en-

ergy from you. If your dataset for this course is not ready in very short delays (i.e.

around two full days of work), do not engage into longer operations that might even-

tually fail and leave you without usable data.

If you get stuck, start by checking the UCLA Stat Computing advice page for guidance:

http://www.ats.ucla.edu/stat/stata/topics/data_management.htm.

7.1. Conversion

Most simply, some datasets come in compressed archives like ZIP files, which you will

need to decompress while making sure that no error occurred during decompression.

Free decompression software exists for all operating systems.

!

50

Do not try to use ASCII data for which you need to use the infix and dictionary com-

mands, which are too time-consuming for the purpose of this course. Ask us for help if

you really need to bypass this recommendation.

If your files come in SPSS or SAS format, or in any other format for use in another sta-

tistical package, you will need to use a conversion utility to convert the data. We should

be able to use Stat/Transfer to help with that process.

Check the Stata FAQ from UCLA Stat Computing for guidance on dataset conversion:

http://www.ats.ucla.edu/stat/stata/faq/default.htm. Again, several encoding issues can

occur during dataset conversion, and you will be required to perform a thorough check

of the result to clear any possible mistake.

If your data come in a format supported by Microsoft Excel, you should export your

data to CSV format and import it into Stata with the insheet command; see

http://dss.princeton.edu/online_help/stats_packages/stata/excel2stata.htm.

Briefly put, your data should all fit on one single Excel spreadsheet and contain nothing

else than the data, except for the header row on the first line of your file. The header

must contain the variable names, which should have short names and must not start

with an underscore (_) or a number.

Your numeric data must not contain formulae and must be formatted as plain num-

bers—do not use any other format, as it might cause issues when importing into Stata.

Furthermore, the numeric values should not use commas (,) as they can disrupt the CSV

format.

Finally, make sure that the missing observations are represented by blank cells in your

data. To do this, you must find and replace all characters that are often used to mark

missing observations (such as “NA”) with either blank space or the standard “.” symbol

for missing values in Stata.

7.2. Merging

You can merge your data in either Stata or Microsoft Excel. Units of analysis are natu-

rally expected to be identical in both datasets.

It is essential that your observations match identically when merging files: for in-

stance, when merging two datasets with country-level data, you will have to make sure

that the countries are present in both datasets under identical names.

Merging Stata datasets uses the Stata merge command, a very powerful tool for merg-

ing and matching your data. The command is very well documented in this handy tuto-

rial by Roy Mill:

http://stataproject.blogspot.com/2007/12/combine-multiple-datasets-into-one.html

!

51

7.3. Reshaping

Your dataset should hold your units of observation in rows, and your variables in col-

umns. If that format is not respected, you will need to reshape your dataset in order to

fit that format.

In this example, the data are format-

ted with year values in columns,

while the units of observation are

displayed in rows.

This format often applies to time se-

ries for country-level data. For exam-

ple, this format applies to OECD da-

ta, as shown here. The data were

provided for Microsoft Excel.

Solving this issue requires to run a

series of steps called “reshaping”.

To reshape data for one variable, follow the following steps carefully:

− Start by making sure that your data have been properly prepared: all variables

must be numeric, and missing observations should be encoded as such.

− Prepare your data as a CSV file. All variables should be labelled on the first line,

and the rest of the file should contain only data (remove any other text or in-

formation).

− Add a letter in front of each year. Select your first line, which contains the vari-

able names, and then use the ‘Edit > Replace…’ menu item in Excel to add a ‘y’

in front of each year. For instance, if your data were collected for years 1960–

2010, find ‘19’ and replace by ‘y19’, and find ‘20’ and replace by ‘y20’.

− Import using the insheet command, and check your data in the Stata data edi-

tor. The example below shows the result with only one variable (health ex-

penditure per capita in some OECD countries).

− Create a unique id value for each unit of observation (in this case, OECD coun-

tries) by typing gen id = _n and then order id. This will add an additional varia-

ble to your data.

!

52

− To reshape your data, type reshape long, y i(id) j(year) to reshape your data

columns that start with a ‘y’, for all rows identified by the id variable, into a dif-

ferent data format, called “long”, where the years will have been fit into a year

variable.

Your dataset will have been converted from its initial “wide” format, with values for

each year in columns, to a “long” format where the values for each year appear on

separate rows.

Once you are in “long” mode, you can rename the variable that you were working on

and drop the observations that you do not need (remember that you are working to

obtain cross-sectional data and not time series).

ren y hexp

la var hexp "Health expenditure per capita"

drop if year != 1975

y1960 y1961 ... y1975 -> y

xij variables:

j variable (16 values) -> year

Number of variables 18 -> 4

Number of obs. 16 -> 256

Data wide -> long

(note: j = 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975)

. reshape long y, i(id) j(year)

!

53

The screenshot on the left shows your data in “long” mode; the screenshot on the right

shows the same data after executing the commands described above.

If you are trying to reshape a dataset that is formatted in “wide” mode with more than

one variable, more steps are required to separate the variables, as described in this tu-

torial: http://dss.princeton.edu/training/DataPrep101.pdf (locate the “Reshape” slides

#3–5).

Note that data reshaping will take a lot of time to achieve if you are merging and re-

shaping data over a large number of files. Do not try to merge more than a handful of

datasets, as any other operation would require more time than this course can reasona-

bly require from you.

Additional options in the reshape command allow to reshape data where the suffix is

not numeric: type help reshape (or the shorthand version, h reshape) for additional

documentation about this step.

7.4. Subsetting

Subsetting your data is a way to analyse only a selected subsample of the data. This can

happen principally for two reasons:

− This course focuses on cross-sectional data and therefore requires that you

subset only one time period if your dataset spans over more than one period of

survey data.

− If you want to analyse only one segment of the data, such as only one country

in a Europe-wide dataset or one age group in a population-wide dataset, you

should subset the data to it.

Application 7a. Subsetting to cross-sectional data Data: NHIS 2009

In order to calculate and analyse the Body Mass Index (BMI) of American respondents,

we use recent data from the National Health Interview Series (study: NHIS). We exam-

ine the structure of the dataset by inspecting the year variable with the fre command

(with trivial formatting options):

!

54

At that stage, we need to select which year we want to work on. An intuitive choice is

the most recent year, if it holds a sufficient number of observations. In this example,

survey year 2009 forms a large subsample of observations, though not the largest.

Given that our dependent variable, the Body Mass Index, will require the height and

weight of each respondent to be calculated, we verify the total number of observations

for the height and weight variables among respondents who were interviewed during

survey year 2009:

The results indicate that survey year 2009 holds a sufficiently large number of observa-

tions, and that the variables of interest are not missing for that year. We thus subset

the dataset to that year with the keep command and the if operator set to select obser-

vations where the year is equal (“==”) to 2009:

We could also have used the drop command to suppress all years that are different

(“!=”)from 2009, although that writing is somehow less intuitive:

An additional check of the year variable shows that subsetting was successful:

Total 251589 100.00 100.00

2009 24291 9.66 9.66 100.00

2008 18913 7.52 7.52 90.34

: : : : :

2001 29459 11.71 11.71 23.12

Valid 2000 28712 11.41 11.41 11.41

Freq. Percent Valid Cum.

year

. fre year, rows(5) nol

weight 24291 172.5895 37.12779 100 285

height 24291 66.61652 3.865753 59 76

Variable Obs Mean Std. Dev. Min Max

. su height weight if year==2009

(227298 observations deleted)

. keep if year==2009

(227298 observations deleted)

. drop if year!=2009

!

55

In this example, the total number of observations which we study in our analysis is thus

N = 24,291, rather than 215,589 for all survey years. The keep and drop commands

also apply to variables, and we could continue here by subsetting the dataset to only a

handful of variables which we plan to use in the analysis, but this course will not require

you to do so.

Valid 2009 24291 100.00 100.00 100.00

Freq. Percent Valid Cum.

year

. fre year, nol

!

56

8. Variables

The basic anatomy of a variable consists of its name and values, to which we can add

labels in order to provide short descriptions of what the variable measures. Data, and

categorical data especially, are rarely understandable without labels.

Your primary source of information for variable codes is always the dataset codebook,

but for practical reasons, some of that information is also stored in your dataset, as you

might have to modify it before running your statistical analysis.

This chapter shows:

− How to inspect variables (Section 8.1). Variable inspection is necessary to learn

how the variable is coded, in order to select appropriate commands for its ma-

nipulation and analysis.

− How to show and set variable labels (Section 8.2). Labels are short text descrip-

tions attached to your variables and to their numeric values. They increase the

readability of your dataset, especially for categorical data.

− How to recode variables into different categories (Section 8.3). Recoding al-

lows you to select the categories in which to manipulate your variables, which is

helpful when you want to analyse particular groups of observations.

− How to solve encoding issues (Section 8.4). Encoding applies to missing data,

which should be coded as “.”, and to variables that hold text, i.e. “strings”,

which are better manipulated when they are encoded with numeric values.

8.1. Inspection

The following commands allow inspecting the names, values and labels of variables:

− codebook is the most exhaustive command if you need to understand your data

structure in depth.

Stata also offers a note function that makes it possible to write an annotated

codebook within a Stata dataset, but this function is idiosyncratic to the DTA

format and limits interoperability.

− describe (shorthand d) is generally used to describe several variables at once, as

when opening a dataset for the first time. It provides three kinds of information:

− The variable name is how the variable is named in your dataset. It is the

name that you pass to Stata through your commands.

− The variable label is a short text description of the variable (gender). It usu-

ally includes the unit of measurement used by the variable when relevant.

!

57

− The value label is the name of a distinct element of data structure that as-

signs text labels to the numeric values of the variable. It will often be the

case that value labels will have the same name as your variable.

− label list (shorthand la li) is one of many label commands to show and edit the

labels featured in a dataset.

Application 8a. Inspecting a categorical variable Data: NHIS 2009

The example below shows the encoding of a variable that codes for respondents’ sex:

To understand how the male and female respondents are coded by the sex variable,

use the label list command (shorthand la li) to display the value label sex_lbl:

The sex_lbl value label is a separate entity from the sex variable itself: it can be applied

to any other variable where it is suitable to have males coded as 1 and females as 2, as

with variables that code for the gender of other persons in the respondent’s household.

When you need to access all information above in a single command, the codebook

command provides detailed output on names, values and labels, as well as more details

on missing data:

sex byte %8.0g sex_lbl Sex

variable name type format label variable label

storage display value

. d sex

2 Female

1 Male

sex_lbl:

. la li sex_lbl

13313 2 Female

10978 1 Male

tabulation: Freq. Numeric Label

unique values: 2 missing .: 0/24291

range: [1,2] units: 1

label: sex_lbl

type: numeric (byte)

sex Sex

. codebook sex

!

58

We later explore a way to code this variable as a dummy, which is actually smarter. On

its own, the mean value of the sex variable is unreadable, but if we code sex as 0 for

males and 1 for females and make that coding explicit by naming the variable female,

then its mean explicitly shows the percentage of women in the sample.

Application 8b. Inspecting a continuous variable Data: NHIS 2009

For continuous data, the d and codebook commands can show the same information as

for categorical data. The codebook command used with the compact (shorthand c) op-

tion displays the variable name and label along its total number of observations, mean

and range. The example below shows its results for age, height, weight and health:

This table presents descriptive statistics in an efficient way that resembles the table of

summary statistics that we will produce at the end of Section 9. Note however that the

health variable is not a truly continuous variable but an ordinal one that codes the self-

reported health of respondents on a subjective scale from 1 (excellent) to 5 (poor).

8.2. Labels

Section 6.2 already introduced the rename (shorthand ren) command to modify varia-

bles names. We now turn to modifying variable and value labels:

− All your variables should be assigned at least one label, the variable label, which

is already set in most datasets.

When creating variables, label them with the label variable (shorthand la var)

command. Include, if applicable, their unit of measurement.

− A second form of label then applies to the values of categorical data, as when 1

codes for “Strongly Agree”, 2 codes for “Agree” and so on.

These labels are modifiable with the label define (shorthand la def) and label

values (shorthand la val) commands.

Application 8c. Labelling a dummy variable Data: NHIS 2009

In this example, we create a variable for the respondents’ Body Mass Index (BMI) and

examine it under three different forms. We first create the variable from the respond-

ents’ weight and height measurements, respectively in pounds and inches:

health 24284 5 2.288709 1 5 Health status

weight 24291 186 172.5895 100 285 Weight in pounds without clothes...

height 24291 18 66.61652 59 76 Height in inches without shoes

age 24291 67 46.81392 18 84 Age

Variable Obs Unique Mean Min Max Label

. codebook age height weight health, c

!

59

Immediately after creating the variable and checking its results, we label the variable

with the signification of the ‘BMI’ acronym to help ourselves and others make sense of

the data at later stages of analysis:

Given that Body Mass Index is a continuous measurement that comes in its own metric,

the bmi variable does not require any additional label. Let’s assume, however, that we

are further interested in identifying respondents with a BMI of 30+, which designates

obesity in the WHO classification of BMI. To that end, we create a dummy variable for

respondents over that threshold:

The gen command created the obese variable and assigned the obese value label to it.

The logical test (bmi >= 30) returned 1 when that statement was true, 0 if false. Obser-

vations for which the bmi variable was missing were excluded from the operation and

therefore preserved as missing.

The result is a dichotomous variable where 1 codes for obesity and 0 otherwise. When

we summarize the obese dummy as a continuous variable with the su command, its

mean provides the percentage of obese respondents in the sample:

In the next commands, we label the obese variable and define the obese value label in

reference to its two possible values (1 if obese, 0 if not):

bmi 24291 27.27 5.134197 15.20329 50.48837

Variable Obs Mean Std. Dev. Min Max

. su bmi

. gen bmi=weight*703/height^2

bmi float %9.0g Body Mass Index

variable name type format label variable label

storage display value

. d bmi

. la var bmi "Body Mass Index"

. gen obese:obese = (bmi >= 30) if !mi(bmi)

. * Dummy for obesity.

obese 24291 .2626076 .44006 0 1

Variable Obs Mean Std. Dev. Min Max

. su obese

!

60

The variable and value labels show in the categorical display of the dummy through the

fre command. The variable is fully specified:

The obese value label can also be assigned to other variables with the la val command,

if you are interested in coding for obesity in other persons than the respondent.

8.3. Recoding

Recoding is a way of producing a new variable out of an existing one, by collapsing

values of the original variable into different categories. Most of the variables in your

dataset are probably ready for use in their original metric, but in some cases, you might

want to recode your variable using one of the following techniques:

− The recode command is very handy to create groups from continuous data or to

permute values in categorical data, as shown in Application 8d. It can also cre-

ate dummies, as shown in Application 8e along with the tab, gen() equivalent.

− The gen command has three extensions – recode(), irecode() and autocode() –

that basically produce the same result as the recode command in less code and

with a bit more flexibility, as shown in Application 8f.

− The replace command can be used to ‘hard-recode’ variables, but you would be

altering your original variables in doing so and therefore running an additional

risk of data-related error. We use the replace command for missing data only.

When you are creating groups from continuous variables, make sure that your catego-

ries do not omit any values from the original variable (exhaustiveness), and that they

do not overlap (mutual exclusiveness). For example, if you plan to recode educational

attainment, make sure that each diploma appears in only one category, and that all lev-

els of education are represented in your new categories.

The number of categories is a substantive issue that depends on your variable. Aggre-

gation can take the form of birth cohorts, age groups or income bands. The only intan-

gible rules that apply are exhaustiveness and mutual exclusiveness (recode all values of

the original variable to a single category of the new variable).

. la def obese 0 "Not obese" 1 "Obese"

. la var obese "Obesity (BMI 30+)"

Total 24291 100.00 100.00

1 Obese 6379 26.26 26.26 100.00

Valid 0 Not obese 17912 73.74 73.74 73.74

Freq. Percent Valid Cum.

obese Obesity (BMI 30+)

!

61

Application 8d. Recoding continuous data to groups Data: NHIS 2009

The age variable, which measures the age of respondents, can be used in its continuous

form or can be recoded into age groups for crosstabulations. To recode the age variable

into four age groups, we use the recode command and create the age4 variable with

the gen option, keeping all names and labels as concise and explicit as possible:

Never operate a transformation like the one above without checking its results. The

whole point of programming your analysis into a do-file is that you can include com-

ments and checks throughout your work. Here, the fre command serves as a technical

verification for the operation:

The number of age groups ultimately depends on your research design. A more fine-

grained categorisation might apply if your hypotheses predict strong generational or

cohort effects, or rely on specific positions in the life cycle—for instance, being part of

the generation that was young and politically active around 1968 in Western countries.

Application 8e. Recoding dummies Data: NHIS 2009

In Application 8a, we saw that the sex variable coded 1 for males and 2 for females in

our dataset. However, we prefer to manipulate dichotomous measures in the form of

dummy variables that use sensible values of 0 and 1 in relation the variable name.

To that end, we recode the sex variable to the female dummy where 1 naturally codes

for being female and 0 for not being female, i.e. male. We could use the gen command

as we did in Application 8c, but the recode command is just as efficient here:

. la var age4 "Age (4 groups)"

(24291 differences between age and age4)

> (65/max=4 "65+"), gen(age4)

> (45/64=3 "45-64") ///

> (30/44=2 "30-44") ///

> (18/29=1 "18-29") ///

. recode age ///

. * Recoding age to 4 groups.

Total 24291 100.00 100.00

4 65+ 4355 17.93 17.93 100.00

3 45-64 8477 34.90 34.90 82.07

2 30-44 6715 27.64 27.64 47.17

Valid 1 18-29 4744 19.53 19.53 19.53

Freq. Percent Valid Cum.

age4 Age (4 groups)

. fre age4

!

62

You should check for exact concordance at that point. Crosstabulating the original and

recoded variables will work, but a quicker concordance test exists here:

Dummies are very common in statistical modelling, and Stata offers more ways to code

information into dummies. The tab command, for instance, can create dummies for

each of its categories with the gen option, as here with marital status:

In this example, the marstat variable has been used to create six dummies, all named

with the married prefix, and each coding for one category of martial status. The dum-

mies are given descriptive variable labels, as shown by the d command when used on

all married1, married2, … married6 dummies at once with the * operator:

The codebook command with the c option shows the mean value for each dummy,

which is also the frequency of its category in the data. The dummy for being divorced,

married2, hence has a mean of .15 and represents 15% of all observations.

(24291 differences between sex and female)

. recode sex (1=0 "Male") (2=1 "Female") (else=.), gen(female)

. * Recoding sex as a female dummy.

0

. count if female != sex-1

Total 24,291 100.00

Unknown marital status 52 0.21 100.00

Never married 6,542 26.93 99.79

Separated 906 3.73 72.85

Divorced 3,696 15.22 69.12

Widowed 1,874 7.71 53.91

Married 11,221 46.19 46.19

Legal marital status Freq. Percent Cum.

. tab marstat, gen(married)

. * Recoding marital status as dummies.

married6 24291 2 .0021407 0 1 marstat==Unknown marital status

married5 24291 2 .2693179 0 1 marstat==Never married

married4 24291 2 .0372978 0 1 marstat==Separated

married3 24291 2 .1521551 0 1 marstat==Divorced

married2 24291 2 .0771479 0 1 marstat==Widowed

married1 24291 2 .4619406 0 1 marstat==Married

Variable Obs Unique Mean Min Max Label

. codebook married*, c

!

63

Application 8f. Recoding bands Data: NHIS 2009

If you need to produce more complex recodes, the recode(), irecode() and autocode()

extensions of the gen command produces results similar to recode in less code and in a

more flexible way that is particularly appropriate to recode continuous data into bands,

as with age class or income bands.

The following example uses irecode() to recode Body Mass Index as the four groups

established by its international classification:

This command creates a first category of respondents for which 0 ≤ BMI < 18.5, which

is classified as underweight, up to a fourth category for 30 ≤ BMI < ∞, which designates

obese respondents. We add variable and value labels to specify the recoded variable,

and finally proceed in checking the recoded variable with the table command:

The table command is used here as a technical check comparing the categories of the

bmi4 variable to the minimum and maximum values of BMI that they respectively hold.

The format (shorthand f) option limits the number of visible floating digits.

8.4. Encoding

Encoding issues have to do with the format of your data. Fundamentally, your dataset

is just a text file with a text encoding and delimiters for columns and rows. Within your

data, additional encoding applies to your data:

− Missing data are often encoded as arbitrary numeric values. These values can

be distinctive, such as -1 for strictly positive data or 9 for ordinal data on a five-

point scale. In other cases, multiple codes are used, as in 77 for “Refused to an-

swer”, 88 for “Do not know” (DNK) and 99 for “No answer” (NA).

. gen bmi4:bmi4 = irecode(bmi, 0, 18.5, 25, 30, .)

. * Recoding BMI to 4 groups.

Obese 6379 30.02 50.49

Overweight 9013 25.01 29.99

Normal 8625 18.51 25

Underweight 274 15.2 18.48

BMI classes Freq. min(bmi) max(bmi)

. table bmi4, c(freq min bmi max bmi) f(%9.4g)

. la var bmi4 "BMI classes"

. la def bmi4 1 "Underweight" 2 "Normal" 3 "Overweight" 4 "Obese"

!

64

Stata requires missing data to be encoded as a dot (.). It also supports multiple

missing data formats: .a, .b, … , .z can be used to encode missing data of dif-

ferent kinds. Stata treats all missing data as +∞ (positive infinity).

Missing data that are not yet coded in Stata format can be addressed with the

replace command. When encoding several variables with identical coding

schemes, the mvdecode command can perform batch encodings.

− Textual data are often encoded as chains of characters. These “strings” of text,

as they are called in programming environments, are difficult to manipulate

from Stata because they do not come with a numeric framework.

Stata requires strings to be encoded with numeric values. Only in specific cir-

cumstances is encoding text neither necessary nor particularly desirable, as

when manipulating singular information like country names.

Text data that are not yet supported by numeric values can be addressed with

the encode command. In the specific case of numbers encoded as text, the

destring command is used instead.

Encoding missing data Data: NHIS 2009

The following example shows a typical encoding issue. If analysed in its current state,

the diayrsago variable will treat values 96, 97 and 99 as valid measurements, therefore

distorting completely any analysis of the variable:

To solve this issue, we need to replace values 96, 97 and 99 with missing data codes

that are recognisable by Stata, i.e. either just “.” for all values or .a, .b and .c for each

of them if we are interested in keeping them distinct from each other. The precise

choice entirely has to do with our research design.

Total 24291 100.00 100.00

99 Unknown-don't know 28 0.12 0.12 100.00

97 Unknown-refused 2 0.01 0.01 99.88

96 NIU 22111 91.03 91.03 99.88

82 82 years 1 0.00 0.00 8.85

81 81 years 1 0.00 0.00 8.85

: : : : :

4 4 years 117 0.48 0.48 2.80

3 3 years 164 0.68 0.68 2.32

2 2 years 163 0.67 0.67 1.65

1 1 year 151 0.62 0.62 0.98

Valid 0 Within past year 86 0.35 0.35 0.35

Freq. Percent Valid Cum.

diayrsago Years since first diagnosed with diabetes

. fre diayrsago, row(10)

!

65

Assuming that we want to fix the issue in the simplest way, two solutions apply. The

first solution modifies the diayrsago variable directly, using the replace command to

substitute values over 95 with missing data:

The alternative code uses the gen command with the cond() operator to create a new

variable through a simple “if… else” statement. The diabetes variable will be equal to

the original diayrsago variable except when it is superior to 95, in which case it will re-

place it with missing data:

Both solutions are almost equivalent, and users might generally prefer the first one for

its simplicity. The second is actually more secure, since it does not overwrite the original

variable; however, it loses creates a new variable and therefore loses labels.

Assuming that we want to preserve a distinction between types of missing data, two

other solutions apply. The first one proceeds as before with the replace command, but

uses the .a, .b and .c missing data markers:

The alternative code is, again, more secure and this time also much quicker. It uses the

recode command to create the diabetes variable, recoding values to missing data in the

process while leaving untouched all other values by default:

Finally, let’s introduce a case where multiple variables are using the same scheme for

missing data, as with the ybarcare and uninsured variables below:

(22141 real changes made, 22141 to missing)

. replace diayrsago=. if diayrsago > 95

. * Simple encoding.

(22141 missing values generated)

. gen diabetes = cond(diayrsago < 95, diayrsago, .)

. * Alternative.

(28 real changes made, 28 to missing)

. replace diayrsago=.c if diayrsago == 99

(2 real changes made, 2 to missing)

. replace diayrsago=.b if diayrsago == 97

(22111 real changes made, 22111 to missing)

. replace diayrsago=.a if diayrsago == 96

. * Detailed encoding.

(22141 differences between diayrsago and diabetes)

. recode diayrsago (96=.a) (97=.b) (99=.c), gen(diabetes)

. * Alternative.

!

66

In this case, the mvencode and mvdecode commands are quicker than others. Correctly

encoding missing data will actually require to use the mvdecode command on both var-

iables while passing the values to be encoded as missing through the mv option:

Data structures can differ markedly, and encoding issues will frequently arise as soon as

you start opening datasets created in other software than Stata. Different encodings for

missing data can be solved quickly, but only if diagnosed: always spend enough time

inspecting your data to learn enough about them.

Application 8g. Encoding strings Data: MFSS 2006

In this example, we look at the Music File Sharing Study, which the Canadian govern-

ment contracted in 2006 to study how digital content affects consumer behaviour. The

survey was documented and analysed in a paper by Birgitte Andersen and Marion

Frenz (Journal of Evolutionary Economics, 2010), and is available from the Industry

Canada website: http://www.ic.gc.ca/eic/site/ic1.nsf/eng/01464.html.

The dataset was imported into Stata using the insheet command, but substantial en-

coding issues plague the data at that stage. These issues can be diagnosed by inspect-

ing the storage type of each variable, but they are more easily evident when browsing

data from the Data Editor, which you can open with the browse command:

Total 24291 100.00 100.00

9 Unknown-don't know 54 0.22 0.22 100.00

2 Covered 19727 81.21 81.21 99.78

Valid 1 Not covered 4510 18.57 18.57 18.57

Freq. Percent Valid Cum.

uninsured Health Insurance coverage status

Total 24291 100.00 100.00

9 Unknown-don't know 3 0.01 0.01 100.00

2 Yes 2477 10.20 10.20 99.99

Valid 1 No 21811 89.79 89.79 89.79

Freq. Percent Valid Cum.

ybarcare Needed but couldn't afford medical care, past 12 months

. fre ybarcare uninsured

uninsured: 54 missing values generated

ybarcare: 3 missing values generated

. mvdecode ybarcare uninsured, mv(9)

. * Batch encoding.

. browse id quest s_dat prov qregn qd8 q2_1a in 1149/1159

!

67

In this screenshot, variables with values in red are simply coded as text with no numeric

value to designate them—a format also known as ‘string’, which is impractical for sta-

tistical analysis as hinted by the warning colour that Stata assigns to their columns.

Let’s start with the prov variable codes for the respondent’s province of residence. Be-

cause of the string format, we have to include double quotes around its values to des-

ignate respondents from, for example, Alberta and British Columbia:

This quickly becomes impractical, so we use the encode command to produce a similar

variable with automatically generated numeric values and labels for each of them:

The numeric encoding makes it possible to select respondents in Alberta or British Co-

lumbia with shorter and more flexible commands, using the values assigned by the

encode command to each category:

293

. count if prov=="AB" | prov=="BC"

Total 2100 100.00 100.00

10 SK 54 2.57 2.57 100.00

9 QC 1006 47.90 47.90 97.43

8 PE 6 0.29 0.29 49.52

7 ON 559 26.62 26.62 49.24

6 NS 48 2.29 2.29 22.62

5 NL 23 1.10 1.10 20.33

4 NB 54 2.57 2.57 19.24

3 MB 57 2.71 2.71 16.67

2 BC 141 6.71 6.71 13.95

Valid 1 AB 152 7.24 7.24 7.24

Freq. Percent Valid Cum.

province PROV

. fre province

. encode prov, gen(province)

293

. count if province < 3

!

68

Similarly, the qd8 variable coding for gender cannot be easily manipulated in its current

form: Stata needs numeric values attached to each of its categories in order to include it

in a regression model, for example.

A simple solution consists in creating a dummy coding for females, as we previously did

in Application 8e. The data is in string format, so we need to use double quotes and

text instead of numeric values to create the appropriate conditional statement:

The encode command would produce a similar result, but dummy variables with explicit

names and codes need not feature labels, so we will settle for that simple solution.

The last example concerns the q2_1 variable, which measures the number of music CDs

that the respondent bought in 2005 for his or her personal use. The variable is stored as

a string because it includes both numbers and text, including empty cells. In that state,

the variable is virtually unusable, so we apply several transformations to it:

The first three commands code for missing data where the q2_1a variable featured text

or empty cells. Since the q2_1a variable is based on text, the arguments of the replace

commands feature double quotes. Once the variable contained only numeric or missing

values, we got rid of the string format with the destring command.

Finally, the numlabel command is a handy workaround for serious encoding issues: try

numlabel _all, add to prefix all textual labels with numeric values. This makes the tab

command more practical (a problem solved in this handbook by using the fre command

instead), and might or might not help in your situation.

*

Data management, as shown by the topics covered in Sections 5–8, is not only long, it

is also complex in its more elaborate stages, and very sensitive to even the smallest mis-

take. We will circumvent that issue by using readymade datasets for which data man-

agement will be reduced to a minimum, but in a real research environment, quantita-

. gen female = (qd8 == "Female") if !mi(qd8)

. * Creating a female dummy from string values.

(493 missing values generated)

q2_1a has all characters numeric; replaced as byte

. destring q2_1a, replace

(34 real changes made)

. replace q2_1a=".c" if q2_1a=="Don't Know/Refused"

(33 real changes made)

. replace q2_1a=".b" if q2_1a=="None"

(426 real changes made)

. replace q2_1a=".a" if q2_1a==""

!

69

tive data skills will often extend to data management, in addition to the other skills in-

troduced in Section 4.

!

70

Part 2

Analysis

Statistical analysis requires learning about the statistical theory that underlies the analy-

sis, as well as about the particular procedures that allow to run the analysis from Stata.

This step is the most knowledge-intensive aspect of the course, as it requires to oper-

ate several commands while knowing what they correspond to in theory, and how to

interpret their results.

Statistical analysis is a professional, scientific activity. Making mistakes while analys-

ing quantitative data is common, and several rounds of analysis are usually required to

obtain reliable results. In practice, it requires the collective effort of scientists worldwide

to work on large projects and to verify that their respective research does not contain

errors of interpretation.

This class emulates that professional setting by organising small-scale research projects

that are then submitted to the scrutiny of the course instructors. The tools used in the

analysis will be restricted to a selection of statistical tests, and to the most common

form of statistical modelling, linear regression.

At that stage, it is essential that you are familiar with working in Stata, and that you

have read the handbook chapters that document the most basic aspects of data struc-

ture, such as sampling. The following section introduces several tests, procedures and

models that will connect your data to particular interpretations, all of which you will be

learning along performing them.

!

71

9. Distributions

Assessing the normality of your dependent variable is essential to your analysis because

the regression model assumes that this variable is normally distributed. This assumption,

and many others that apply to regression modelling, are systematically violated, be-

cause the normal distribution is a theoretical construct.

At that stage, you should make sure that you have understood the Central Limit Theo-

rem by reading from your handbook. Try plotting the normal distribution in Stata by

typing twoway function y=normalden(x), range(-4 4), and check that you understand

how standard deviations relate to this curve. As a side note, the course will also men-

tion other (Poisson, binomial) distributions, but you will not be working directly with

either of them.

Thankfully, linear regression is quite robust to deviations from normality in your de-

pendent variable. This basically means that your analysis retains most of its validity even

if your variables express some departure from normality. Still, you should aim at having

as normally distributed a dependent variable as possible.

Assessing normality is a two-step process that starts with visual inspections of the dis-

tribution, and then continues with formal tests of normality. Following that step, you

might try different variable transformations to see whether there exists a mathematical

way to make the distribution of your dependent variable approach normality. Finally,

you will have to think about outliers (outstanding observations in your data).

These operations are absolutely essential to your analysis, because quantitative analysis

does not magically proceed by throwing aggregate data at a statistical software solu-

tion. Instead, it relies on careful modelling that aims at fitting real-world observations

into abstract models. The ‘goodness of fit’ of your model will determine the quality of

your analysis.

9.1. Visualizations

Prior to visualizing a variable, you can learn about it with descriptive statistics. These

steps come after reading the dataset documentation, and thus presume that you have

an idea of how the variable is coded and how many observations are available for it,

even if the commands listed below also inspect these:

− For continuous data, the summarize command (shorthand su) provides the

number of observations for a variable, as well as its mean, standard deviation,

minimum and maximum values.

The summarize command with the detail option will add percentiles and vari-

ance, as well skewness and kurtosis, which we will return to when assessing the

normality of the distribution (Section 9.3).

!

72

For more specific operations, the tabstat command is a more flexible tool that

will provide any statistics passed with its s() option, including any of the above

as well as all possible statistics listed in its documentation.

− For categorical data, the fre command will give you the best approach to the

variable by listings its frequencies while paying attention to its coding and miss-

ing values. Install it with ssc install fre.

Using the standard Stata commands, you would have to use the tab command

with the missing and nolabel options, along with the label list command, to ob-

tain similar results.

Once you have learnt enough about your variable through descriptive statistics, you

should turn to visualizations:

− With continuous variables, you will be using the histogram as your main tool,

complemented with the box plot in order to spot outliers (Section 9.5). Useful

central tendency descriptive statistics will be the mean and median.

− With categorical variables, you will be using the categorical bar plot, although

its value added to a simple frequency table is often open to question. A useful

descriptive central tendency statistic will be the mode.

The examples below illustrate these options. We will alternate between several of the

course datasets to illustrate how descriptive statistics and visualization work with differ-

ent types of data.

Important: observations in each example are not systematically weighted, as to keep

the code as simple and demonstrative as possible, but you should apply weights when

your dataset offers them. Weighting, as shown in Example 5a and Example 5b, modify

the statistical importance of observations within the sample in relation to how it was

initially designed. Sample size might also affect your description of the data, as shown

with confidence intervals in Example 9d.

Example 9a. Visualizing continuous data Data: NHIS 2009

The dependent variable in this example is the Body Mass Index (BMI) for a large sample

of American respondents in 2009, calculated from measures of weight in pounds and

squared height in inches, labelled, and then described, using the following commands:

bmi 24291 27.27 5.134197 15.20329 50.48837

Variable Obs Mean Std. Dev. Min Max

. su bmi

. la var bmi "Body Mass Index"

. gen bmi = weight*703/(height^2)

!

73

The interpretation of the statistics above will cover different things:

− First, the total number of observations (Obs) is satisfying: the bmi variable is

available in a large fraction of the data—actually, for 100% of observations in

the dataset.

− Then, the average Body Mass Index (Mean) in our sample is remarkably high—

as a BMI of 27 is already considered to be “overweight” in the official categori-

zation of the BMI measure.

− The standard deviation (Std. Dev.) further qualifies the distribution by giving its

spread: for instance, 95% of all observations fall within two standard deviations

from the mean, i.e. between 22 and 32.

− Finally, the range of values (Min and Max) indicate that the distribution is

skewed, since the maximum value of 50 is further away from the mean than the

minimum value of 15. A box plot will confirm this.

We then turn to visualizations, using appropriate graphs for continuous data:

The histogram (shorthand hist) command was passed with two options: normal, which

overlays a normal distribution to the histogram bars, and percent, to use percentages

instead of density on the vertical y-axis. The graph hbox command comes in two dis-

tinct words because it belongs to the graph (shorthand gr) class of commands, which

can be passed options to modify its axes, titles and so on.

From the graphical results of these commands, we observe that the bmi variable is not

normally distributed due to its disproportionate amount of right-hand-side values that

form a long ‘right tail’ in the histogram, and outliers in the box plot:

The histogram shows a distribution that is skewed to the right, and the box plot shows

that BMI values over 40 are outliers to the distribution, located over 1.5 times the in-

terquartile range (Section 9.5 deals with outliers in more detail).

. gr hbox bmi

(bin=43, start=15.203287, width=.82058321)

. hist bmi, normal percent

0 2 4 6 8 10

Percent

10 20 30 40 50

Body Mass Index

10 20 30 40 50

Body Mass Index

!

74

A precise look at the BMI variable would also reveal that its mean and median are quite

close, indicating some extent of symmetry in the distribution despite the skewness

mentioned before. We will come back to these notions.

Example 9b. Kernel density plots Data: NHIS 2009

Histograms use bars (or “bins”) to represent a distribution. A different tool to visualize

a distribution is the kernel density plot, which also displays the density of the distribu-

tion, but uses smoothed lines instead of bars.

The left-side example below shows the commands to produce a histogram and a kernel

density plot for the distribution of Body Mass Index. The options set the width of the

histogram and kernel density along other options (see help histogram):

The graph on the right shows a quicker way to draw a kernel density with the kdensity

command and the normal option. In both graphs, the skewness observed in the kernel

density curve also shows in a comparison of the mean and median values:

Example 9c. Visualizing categorical data Data: ESS 2008

If you are inspecting a categorical variable, you will realise that the distribution of the

variable makes little sense. Furthermore, the tools described above will turn out to be

either inappropriate or of very little help. Instead, you will look at proportions, and you

will need to install the additional fre and catplot packages.

The fre command is particularly useful to handle missing observations. In the example

below, we look at attitudes towards poorer immigration in/to Europe. We can tell from

the distribution of that only 4.5% of observations are missing, and can also read the

percentages of each response item to the question:

(bin=18, start=15.203287, width=2)

. hist bmi, w(2) normal kdens kdenopts(bw(2) lc(red))

0.02 .04 .06 .08

Density

10 20 30 40 50

Body Mass Index

0.02 .04 .06 .08

Density

10 20 30 40 50

Body Mass Index

Kernel density estimate

Normal density

kernel = epanechnikov, bandwidth = 0.5945

Kernel density estimate

bmi 24291 27.27 26.57845 .7207431

variable N mean p50 skewness

. tabstat bmi, s(n mean median skewness)

!

75

When visualizing categorical data, follow two recommendations:

− Do not use pie charts. The human eye is not used to read polar coordinates,

which makes the vast majority of pie charts useless at best, deceitful at worst.

− Produce a horizontal bar plot of the valid cases with the catplot command, but

ask yourself whether the graph brings any substantial information to the reader.

The answer is most likely negative.

The plots below show the impcntr variable as a histogram and as a categorical bar plot,

but neither visualization brings little more than a frequency table:

Frequency tables like the ones produced by the fre command can be formatted to fit

into tables with other descriptive statistics (Section 13.4).

Total 51142 100.00

Total 2319 4.53

.c 56 0.11

.b 2144 4.19

Missing .a 119 0.23

Total 48823 95.47 100.00

4 Allow none 9523 18.62 19.51 100.00

3 Allow a few 17054 33.35 34.93 80.49

2 Allow some 16496 32.26 33.79 45.56

live here

Valid 1 Allow many to come and 5750 11.24 11.78 11.78

Freq. Percent Valid Cum.

impcntr Allow many/few immigrants from poorer countries outside Europe

. fre impcntr

. catplot impcntr, percent blabel(bar, format(%3.1f)) yti("")

(start=1, width=1)

. hist impcntr, percent discrete addl

11.78

33.79 34.93

19.51

010 20 30 40

Percent

01234

Allow many/few immigrants from poorer countries outside Europe

19.5

34.9

33.8

11.8

010 20 30 40

Allow none

Allow a few

Allow some

Allow many to come and live here

Allow many/few immigrants from poorer countries outside Europe

!

76

Example 9d. Survey weights and confidence intervals Data: ESS 2008

A drawback of plotting distributions without first taking a look at the underlying data

structure is that the resulting plots can hide large confidence intervals. Differences in

proportions that are based on a low number of observations come with large confi-

dence intervals that might minimise – or even cancel – the visual differences that we

might observe on a graph.

In the example below, the survey question from Example 9c is analysed for French

adult citizens only (study: ESS, variable: impcntr, with additional variables to select the

target group made of French adult citizens). The number of valid observations for this

target group is markedly lower than previously, with only 1884 non-missing observa-

tions:

From that question, the actual proportion of French respondents who support a harsh

anti-immigration policy is hard to determine:

− As shown in the frequency table above, respondents who prefer allowing

“some” or “many” immigrants from poorer countries outside Europe form a

minority of 48.94%, as calculated from the cumulative distribution of all non-

missing observations. Any politician who plans on campaigning on the issue of

immigration will be interested in the figure, to side with either the minority or

the majority of potential voters.

− An important issue, however, is that the sample uses survey weights to make its

observations more representative of the actual national population, as explained

in Example 5a. Furthermore, the number of observations in the sample only al-

lows us to estimate values for the rest of the population, therefore involving

confidence intervals.

Total 1935 100.00

Total 51 2.64

.b 38 1.96

Missing .a 13 0.67

Total 1884 97.36 100.00

4 Allow none 258 13.33 13.69 100.00

3 Allow a few 704 36.38 37.37 86.31

2 Allow some 771 39.84 40.92 48.94

live here

Valid 1 Allow many to come and 151 7.80 8.01 8.01

Freq. Percent Valid Cum.

impcntr Allow many/few immigrants from poorer countries outside Europe

. fre impcntr if cntry=="FR" & age >= 18 & ctzcntr==1

!

77

If we weight the data before producing the frequency table, using the [aw] prefix and

the dweight variable mentioned in Example 5a, respondents who prefer allowing

“some” or “many” immigrants actually form a majority:

Survey weights also apply with the [pw] suffix to commands like prop, which computes

the confidence interval of each category based on a normal approximation:

The confidence intervals above are large because of the limited number of valid obser-

vations. We selected the convenience standard 95% level of significance, and intervals

would widen even more at 99% significance.

Total 1935 100.00

Total 46.06004 2.38

.b 34.23363 1.77

Missing .a 11.82642 0.61

Total 1888.94 97.62 100.00

4 Allow none 237.9166 12.30 12.60 100.00

3 Allow a few 692.9715 35.81 36.69 87.40

2 Allow some 795.0004 41.09 42.09 50.72

live here

Valid 1 Allow many to come and 163.0514 8.43 8.63 8.63

Freq. Percent Valid Cum.

impcntr Allow many/few immigrants from poorer countries outside Europe

. fre impcntr if cntry=="FR" & age >= 18 & ctzcntr==1 [aw=dweight]

_prop_4 .1259525 .0081542 .1099602 .1419447

_prop_3 .3668574 .0120768 .343172 .3905427

_prop_2 .4208712 .0124809 .3963932 .4453491

_prop_1 .086319 .0073835 .0718383 .1007997

impcntr

Proportion Std. Err. [95% Conf. Interval]

_prop_4: impcntr = Allow none

_prop_3: impcntr = Allow a few

_prop_2: impcntr = Allow some

_prop_1: impcntr = Allow many to come and live here

Proportion estimation Number of obs = 1884

. prop impcntr if cntry=="FR" & age >= 18 & ctzcntr==1 [pw=dweight]

!

78

9.2. Options

You might have noticed that many graphs produced above use graph options, often to

modify the unit, scale or title of an axis. Full-fledged books have been written on the

topic, and the most common options for this course follow:

− Scales that use percentages (percent) or frequencies (frequency) are often more

useful than density or fractions in histograms. Additionally, you will sometimes

want to add labels to your histogram plots with the addlabel option, or to cat-

plot bar plots with the blabel(bar) option. Learn more from the documentation

pages for each type of graph.

− You might also need to use the ytitle and xtitle options to give shorter titles to

the axes in plots produced by graph commands, or to remove the titles. The

same applies to the title of your graph, which you can set with the title option.

Additionally, you can add a short note to your graph (often a mention of the

data source) with the note option.

− Finally, the xscale and yscale options allow controlling the full range of your ax-

es, along with the xlabel and ylabel options that control the spacing between

labelled ticks. These options also apply to all graph commands and are particu-

larly useful to make the values on your axes correspond to the real set of values

that your data can possibly take.

Example 9e. Democratic satisfaction Data: ESS 2008

In this example, the main variable of interest is the assessment of democracy in the re-

spondent’s country (stfdem). The codebook indicates that answers to the question

were coded on an interval scale of 0 (“Extremely dissatisfied”) to 10 (“Extremely satis-

fied”), which means that we can read the mean of the variable as an average score of

satisfaction with democracy. We applied both design and country weights to compute

a European average score of democratic satisfaction, as explained in Example 5b:

We could try collapsing individual answers by country in order to observe, for example,

if respondents in Greece are more supportive of democracy than those in Britain. Both

countries can claim a long historical experience with democratic institutions, but Greece

went through an autocratic period in the recent period, and general levels of economic

wealth are lower in Greece than they are in Britain. Grouping observations at the coun-

try-level might thus provide some useful heuristics preliminary to our analysis.

stfdem 48711 52569.9013 4.628696 2.615244 0 10

Variable Obs Weight Mean Std. Dev. Min Max

. su stfdem [aw=dweight*pweight]

!

79

Further to grouping by country (cntr), we will separate citizens from non-citizens

(ctzcntr) to observe any gap in appreciation between both groups. At that stage, the

most straightforward graph command to plot these insights would be to use the follow-

ing, which also accounts for design weights:

Unfortunately, the default settings produces a distressfully confused result:

− The vertical y-axis is unreadable because it plots the stfdem variable over 26

countries and over two groups of non-/citizens, ending in 52 lines of graph.

− Since the average support scores for democracy are not properly ordered, we

will not be able to read the average scores from most to least satisfied.

− Additionally, we will want to add an informative legend to the scale: the default

one, “mean of stfdem”, is not straightforward enough.

The graph above is almost entirely useless. A suitable dot plot will use several additional

options and look like the following command, which runs over more than one line, as

the “///” breaks indicate:

. gr dot stfdem [aw=dweight], over(ctzcntr) over(cntr)

> scale(.8) name(stfdem, replace)

> ytitle("Satisfaction with democracy in country") ///

> legend(label(1 "Citizens") label(2 "Foreigners")) ///

. gr dot stfdem [aw=dweight], over(ctzcntr) asyvars over(cntr, sort(1) des) ///

0246810

Satisfaction with democracy in country

BG

UA

HU

RU

TR

PT

IE

GR

FR

EE

SI

GB

PL

CZ

SK

IL

BE

DE

ES

NL

SE

FI

CY

NO

CH

DK

Citizens Foreigners

!

80

The graph reveals an interesting gradient of opinions, and sometimes large gaps (at

least at the intuitive, visual level) between the subpopulations of citizens and foreigners.

The list of options used in the graph goes as follows:

− The vertical axis is called with the over() options, and the asyvars option makes

sure that both citizens foreigners, the categories of the ctzcntr variable, are plot-

ted on the same lines.

− The sort(1) des option orders the categorical (country) axis by using the de-

scending order of first value displayed on the graph, namely satisfaction among

citizens.

− The legend and label options allow to rewrite the legend of the graph to “Citi-

zens” and “Foreigners” instead of using the “Yes/No” labels of the ctzcntr var-

iable.

− The ytitle option provides a title to the continuous axis; this ids different to the

name option, which stores the graph in Stata memory under stfdem, overwrit-

ing (replace) any previous graph with that name.

− The scale(0.8) option serves to reduce the size of all items in the graph to 80%

of their default size, including the labels on both axes and the dots that mark

the average score for democratic satisfaction.

Other useful graph options are:

− yreverse reverses the continuous y-axis (which is, in fact, horizontal), in order to

plot variables where the coding and labels are inversed, as in questions where

high approval is coded as “1” and disapproval as “4”.

− yscale(log) switches the axis to logarithmic scale, in order to obtain a better vis-

ual differentiation of high values. It is better, however, to transform the variable

if you plan to use a logarithmic scale to measure it.

− ylabel(1(1)4) and exclude0 modify the continuous axis by respectively setting

the range to 1–4 with ticks every 1 point, and by excluding 0 from the axis of

dot plots, when 0 is not a relevant variable value.

Tweaking the plot to obtain what can be considered a suitable visual result relies on an

admittedly long list of options. However, quality is always preferable over quantity

when it comes to graphs, and these options are useful in many settings.

You might also object that the final graph is still quite difficult to read, and you would

be right: even though recent versions of Stata come with rather powerful graphing ca-

pabilities, its poor default settings are sometimes discouraging, and some researchers

prefer to use other software at that stage.

!

81

9.3. Normality

Normality is the assessment of whether a variable follows the normal distribution.

The normal distribution is an abstract concept: it refers to a curve that can be translated

into a probabilistic situation, called a probability density function. This function is used

to produce estimates of values and their confidence intervals. Cut the music, we need

your full attention for a few moments.

Start with a constant, k = 5. Let’s say that k is the amount of money that you are will-

ing to donate right now to a human rights organization. Captured at this unique mo-

ment of time and of your own preferences, this number alone does not vary: k = 5. If I

try to guess it, there is a 100% chance that the answer “5” is correct: Pr(k = 5) = 1. All

other predictions of k have a probability of 0: Pr(k ≠ 5) = 0.

Release a few assumptions and let k vary in space; call it x to mark that step. We keep

our cross-sectional assumption, and will thus leave aside temporal variation in prefer-

ences about human rights organizations. Let us assume for now that we want to pre-

dict k for the whole population of, say, Russian citizens: how many is each of them will-

ing to donate to human rights organizations? The variable k can now take any value.

Statistical theory intervenes at that point: the more values k can take, and the more ob-

servations of k we have, the better we can predict it. This is applicable to coin flipping

as it is to human rights donations, and derives from the Central Limit Theorem, which

can predict the value of k among all Russian citizens from just a sample of them, even if

we do not know how many Russian citizens actually exist.

For our purposes, the population of Russian citizens is the sum of individual preferences

about human rights donations, P = { k1, k2, …, kN }. Each item kn is a value: the

amount of money one Russian citizen would be willing to donate. In this population,

there is a mean value of k. In parallel, if we draw a sample of that population, we can

calculate its mean value, and its standard deviation.

When the population is unobservable as a whole, our objective becomes to estimate a

population parameter, the true mean value of k, from sample parameters, by observing

the mean value of k in a smaller population of n respondents to a survey that was de-

signed to reach Russian citizens and measure the extent of their willingness to donate

to human rights organizations.

The amount of craft and technique that goes into survey design is immense, and the

amount of bias that can be generated at that stage is too substantial not to mention it.

Hopefully, there are thousands of well-run surveys with careful sample design, and the

stability of some results is another way to gain confidence in our ability to measure the

real world, even in its most intricate aspects.

which has interesting properties for building probabilities to estimate a population pa-

rameter from its sample parameter.

Regarding normality, the summarize command run with the detail option gives you

several indications that serve to understand the distribution of your variable.

!

82

As far normality goes, you should concentrate on two indicators:

− Skewness is an indication of how close to being symmetrical the distribution of

your variable is. Skewness should approach 0, since the normal distribution is it-

self perfectly symmetrical.

− Kurtosis is an indication of how “thick” the tails of your variable distribution

are. Kurtosis should approach 3, since that number approximates the tails of a

normal distribution.

Remember that the normal distribution is a theoretical construct: deviations to it are

hence natural. You should, however, assess the extent of the deviation from normality,

to know how theoretical assumptions apply to your work.

Example 9f. Normality of the Body Mass Index Data: NHIS 2009

In the example below, continued from Example 9a, the skewness statistic of the BMI

variable deviates from 0. Its positive sign indicates that the right-hand-side of the distri-

bution is causing that deviation:

There are more complex ways to assess normality: some statistical tests apply, such as

the Shapiro-Francia test with the sfrancia command if your data is made of less than

5,000 observations of ungrouped – ‘unpaired’ – data. These tests, however, are even-

tually less useful than graphic assessment with distributional diagnostic plots: the sym-

metry plot (symplot), which tests for symmetry, the normal quantile plot (qnorm) and

the normal probability plot (pnorm) all work towards that end.

The last two plots are shown below: what they respectively show is that the BMI varia-

ble is deviating from the normal distribution both in its central values (as shown in the

pnorm plot) and at its tails (as shown in the qnorm plot):

99% 41.59763 50.48837 Kurtosis 3.463278

95% 36.91451 50.48837 Skewness .7207431

90% 34.32617 50.38167 Variance 26.35998

75% 30.22843 49.60056

Largest Std. Dev. 5.134197

50% 26.57845 Mean 27.27

25% 23.51343 15.5041 Sum of Wgt. 24291

10% 21.26276 15.20329 Obs 24291

5% 20.11707 15.20329

1% 18.30729 15.20329

Percentiles Smallest

Body Mass Index

. su bmi, d

. pnorm bmi

. qnorm bmi

!

83

9.4. Transformations

The tools that are used to find possible transformations of a variable are:

− The ladders of powers (gladder) and ladder of quantiles (qladder), which pro-

vide visual guides to common variable transformations;

− The ladder command, from which the best transformation can be chosen by se-

lecting the one with the lowest Chi-squared statistic in the table.

Some common transformations apply to macro-data: country population and GDP per

capita are better expressed in logarithmic units. Others apply to micro-data: the distri-

bution of age, for instance, is often better captured in squared units. Transformations

are generally theoretically informed and will matter when interpreting your data.

Example 9g. Transforming the Body Mass Index Data: NHIS 2009

Running the commands above suggest that BMI approaches normality when measured

on a logarithmic scale (middle graphs):

Performing a logarithmic transformation in Stata requires to use the ln() function to cal-

culate the natural logarithm of the original BMI variable:

. gladder bmi

. qladder bmi

0

01.0e-05

1.0e-05

1.0e-052.0e-05

2.0e-05

2.0e-053.0e-05

3.0e-05

3.0e-054.0e-05

4.0e-05

4.0e-055.0e-05

5.0e-05

5.0e-050

0

050000

50000

50000100000

100000

100000150000

150000

150000cubic

cubic

cubic0

0

05.0e-04

5.0e-04

5.0e-04.001

.001

.001.0015

.0015

.0015.002

.002

.0020

0

0500

500

5001000

1000

10001500

1500

15002000

2000

20002500

2500

2500square

square

square0

0

0.02

.02

.02.04

.04

.04.06

.06

.06.08

.08

.0810

10

1020

20

2030

30

3040

40

4050

50

50identity

identity

identity0

0

0.2

.2

.2.4

.4

.4.6

.6

.6.8

.8

.84

4

45

5

56

6

67

7

7sqrt

sqrt

sqrt0

0

0.5

.5

.51

1

11.5

1.5

1.52

2

22.5

2.5

2.52.5

2.5

2.53

3

33.5

3.5

3.54

4

4log

log

log0

0

05

5

510

10

1015

15

1520

20

2025

25

25-.25

-.25

-.25-.2

-.2

-.2-.15

-.15

-.151/sqrt

1/sqrt

1/sqrt0

0

020

20

2040

40

4060

60

60-.07

-.07

-.07-.06

-.06

-.06-.05

-.05

-.05-.04

-.04

-.04-.03

-.03

-.03-.02

-.02

-.02inverse

inverse

inverse0

0

0200

200

200400

400

400600

600

600800

800

800-.004

-.004

-.004-.003

-.003

-.003-.002

-.002

-.002-.001

-.001

-.0010

0

01/square

1/square

1/square0

0

05000

5000

50001.0e+04

1.0e+04

1.0e+041.5e+04

1.5e+04

1.5e+04-.0003

-.0003

-.0003-.0002

-.0002

-.0002-.0001

-.0001

-.00010

0

01/cubic

1/cubic

1/cubicDensity

Density

DensityBody Mass Index

Body Mass Index

Body Mass IndexHistograms by transformation

Histograms by transformation

!

84

We can now check that the skewness and kurtosis of BMI are closer to 0 and 3 than

they previously were, by displaying the histograms for both ‘raw’ and ‘transformed’

BMI side by side with the graph combine command:

A visual check is usually enough to observe whether a transformation effectively brings

a variable closer to normality, as it does here:

We can finally check for skewness and kurtosis in both variables, in order to see how

much more symmetrical the transformed variable is (skewness ≈ 0), and how its tails

match the tails of the normal distribution (kurtosis ≈ 3):

In this example, both aspects of the distribution are now closer to normality: the rest of

our analysis might hence use the transformed BMI variable. Note that the transfor-

mation only affects the unit of measurement for BMI: it does not imply modifying the

actual data beyond that characteristic.

. la var logbmi "Body Mass Index (log-units)"

. gen logbmi = ln(bmi)

. gr combine bmi logbmi, ysize(2)

(bin=43, start=2.7215116, width=.02791236)

> title("log(BMI)", margin(medium)) xtitle("") name(logbmi, replace)

. hist logbmi, normal ///

(bin=43, start=15.203287, width=.82058321)

> title("BMI", margin(medium)) xtitle("") name(bmi, replace)

. hist bmi, normal ///

0.05 .1

Density

10 20 30 40 50

BMI

0.5 11.5 22.5

Density

2.5 33.5 4

log-BMI

logbmi .2346392 2.762445

bmi .7207431 3.463278

variable skewness kurtosis

. tabstat bmi logbmi, c(s) s(skew kurt)

!

85

9.5. Outliers

Your data might contain outliers, such as a small number of people who earn salaries

that are very, very far above the median income, or a small number of states with ex-

cessively small populations. What to do with outliers is primarily a substantive ques-

tion that depends on your research design.

Conventionally, mild outliers are observations identified by a value located over 1.5

times the interquartile range (IQR) of the variable under examination, and extreme out-

liers by a value over 3 times the same measure. Refer to the course material for details

on how box plots are constructed.

The examples below use box plots and the extremes package to detect outliers. More

advanced techniques for detecting outliers, either before analysis – using graph matrix-

es – or during regression analysis – using leverage-versus-residual-squared plots – are

beyond the scope of this course.

Example 9h. Inspecting outliers Data: NHIS 2009

When we plotted Body Mass Index in the United States, the presence of a large num-

ber of outliers was graphically observable on the right-hand side of the box plot distri-

bution (Example 9a).

We have no substantial reason to exclude outliers from this distribution, so we will just

explore them with the extremes command and the iqr(3) and N options to list and

count respondents who are extreme outliers to the BMI distribution (keep in mind that

values of BMI over 40 will usually indicate morbid obesity):

We previously found a transformation of BMI that brought the distribution close to

normality, so excluding outliers would make little sense overall. Furthermore, removing

mild (n = 421) or extreme (n = 3) outliers to the BMI distribution in a sample of N =

24,491 observations would not make a statistical difference.

Example 9i. Keeping or removing outliers Data: QOG 2011

Detecting outliers can serve a purely informative purpose, but you might also want to

consider working on a subset of your data to exclude outliers from the analysis. In the

N 3 3 3 3 3 3

22511. 3.017 50.48837 Female 63 Hispanic Very Good

17683. 3.017 50.48837 Female 28 Hispanic Good

22943. 3.001 50.38167 Female 30 Black Good

obs: iqr: bmi sex age raceb health

. extremes bmi sex age raceb health, iqr(3) N

!

86

example below, we study private health expenditure as a fraction of gross domestic

product (variable: wdi_prhe). Options passed to the graph hbox command will identify

the outliers:

The distribution of private health expenditure shows a small number of outliers. Further

exploration of the histogram shows that the outliers are creating a clear deviation from

normality on the right hand side of the distribution:

Depending on our research design, we might want to get rid of outlier countries spend-

ing more than, say, 5% of their GDP on private health expenditure. This would make

statistical sense, as the distribution of the variable comes closer to normality when we

apply that modification to the data:

The commands above created a wdi_prhe2 variable by copying values of private health

expenditure from the wdi_prhe variable when these were inferior to 5. The operation

excluded a few countries, and the distribution of the new variable now better satisfies

the normality criteria.

Statistically, it would make sense to stick with the wdi_prhe2 variable for the rest of the

analysis. However, excluding data points (observations) requires a substantive justifica-

tion: we would thus have to document the exceptionality of health expenditure in the

outlier countries prior to their exclusion.

. gr box wdi_prhe, mark(1, mlabel(cname))

Nigeria

Afghanistan

Georgia

Liberia

United States

Sao Tome and Principe

0 2 4 6 8 10

Private Health Expenditure (% of GDP)

0.1 .2 .3 .4

Density

0 2 4 6 8 10

Private Health Expenditure (% of GDP)

wdi_prhe2 176 2.33017 .2326665 2.370385

wdi_prhe 188 2.592303 1.337639 5.993119

variable N mean skewness kurtosis

. tabstat wdi_prhe wdi_prhe2, c(s) s(n mean skewness kurtosis)

(18 missing values generated)

. gen wdi_prhe2 = wdi_prhe if wdi_prhe < 5

!

87

10. Association

The section should be rewritten to merge the two first subsections and to treat t-tests

and correlation in a closer way, as to reflect their parametric nature and how they help

understanding regression afterwards. Moving correlation here makes sense since the

subsection on controls can introduce multicollinearity with correlation and graph ma-

trixes. Possible structure: (1) theory (2) crosstabs with all nonparametric tests (3) par-

ametric:

t

-test (4) correlation (5) controls. A brief note about ANOVA could perhaps

be made in the last section.

In this section, you will test your variables for independence, that is, you will assess

whether you should reject or retain the null hypothesis that states an absence of associ-

ation between two variables, such as income and education or population density and

GDP.

These tests are useful to your analysis because they will suggest whether your inde-

pendent variables are suited for inclusion in your regression model. The tests will also

allow to identify some of the interactions that might exist between your independent

variables.

At that stage, you should make sure that you understand the statistical terms that relate

to probability distributions. Remember, for example, that you have to determine the

‘alpha’ level of significance that you will be using before looking at p-values and other

elements of your tests.

You might also want to check that you understand the core logic of association. The

key problem with understanding causal inference in observational (i.e. non-

experimental) settings is confounding, and crosstabulation is a simple method to mini-

mize confounding. We will later learn about simple and multiple linear regression mod-

elling, other statistical methods developed with a similar purpose.

10.1. Tests

At that stage of your analysis, you will be running independence tests, which by defini-

tion contain two groups (and usually two different variables) to compare. Choosing

which test to apply depends primarily on the type of your variables:

− When both variables are categorical, the test will operate on a table with a few

rows and columns, called a crosstabulation or a contingency table.

− When the dependent variable is continuous and the independent variable is di-

chotomous, different tests will operate by comparing differences in means or

differences in proportions.

!

88

− These tests do not cover all possible situations and form only a preliminary step

to regression, which allows analysing two or more variables of any type. We al-

so leave correlation aside for Section 11.

Bivariate tests introduce some important caveats of statistical tests:

− Association is not causation: finding that an association exists between two

variables does not imply that the variables are causally related. Both variables

can be – weakly, moderately or strongly – associated, but moving from associa-

tion to causation requires a theoretical and substantive understanding of the

variables that no statistical test can provide on its own. The same is true of cor-

relation, which we will restate in Section 11.1 before writing our linear regres-

sion models.

For example, an association between the level of income and the level of educa-

tion of individuals does not provide any information on the causal links that re-

late income to education: we need a theory of how, for instance, the income of

a household influences the educational attainment of its children, and how, in

turn, the level of education of these children will determine their income when

they start to work. The statistically significant association of education and in-

come itself contains, in itself, none of these explanatory, theoretical elements.

The same would be true of an association between music tastes and political

ideology: the direction of the causal link between those characteristics is not sit-

uated in the association itself, but in the theoretical understanding of their in-

terplay. Jumping from association to causation with no explanatory theory

would be premature. It is also sadly common.

A more complex example is religion and life expectancy. An association be-

tween these variables might seem to indicate a ‘direct’ link at the individual lev-

el, where religion affects life expectancy positively or negatively; the same asso-

ciation, however, might also indicate a more ‘indirect’ link at the collective level,

where religion correlates, for instance, with socioeconomic groups of individuals

who enjoy higher or lower life expectancy for reasons (like income) other than

religious beliefs. In this case, jumping from association to causation would have

erroneously advanced a micro-level interpretation to explain a macro-level phe-

nomenon. The mistake of considering individual-level characteristics as explana-

tory of group-level ones is called the ‘ecological fallacy’ and is also relatively

common.

− Statistical significance is not substantive significance: finding a statistically sig-

nificant association between two variables does not imply that there exists a

substantively significant association between them. Some associations can be

statistically significant but devoid of any substantive significance, and converse-

ly.

For example, if a statistical test shows that countries where people drive on the

right-hand side have higher fertility rates, it seems safe to state that this statisti-

!

89

cally significant association corresponds to no substantive phenomenon occur-

ring in the real world. Conversely, the substantively significant association that

exists between former colonial occupation by European countries and the side

on which people drive might not yield a statistically significant association, for

several reasons such as small sample size, coding errors or unobserved changes

in traffic policy.

Identically, statistical significance does not provide an order of magnitude to the

substantive significance of the association. The statistical strength of an associa-

tion relies on data and sample size, and does not indicate that the association is

theoretically more important. A study could find a weak association (p < .1) be-

tween dictatorship and civil war, and also find a strong association (p < .001)

between oil resources and civil war, and still conclude that dictatorship is a more

important explanatory factor of civil war than the presence of oil. The signifi-

cance levels of data analysis are frequently confused with theoretical results,

which is why it is crucial to remember that statistical significance only stands for

a conventional indication of significance, usually based on the p < .05 level. Any

further interpretation of significance will have to be theoretically, not statistical-

ly, driven.

Example 10a. Trust in the European Parliament

The following examples measure the average trust in the European Parliament. We use

the measures available for the populations of three peripheral European countries, Por-

tugal, Ireland and Greece, which became infamously known as the ‘PIG’ countries dur-

ing the current financial crisis (study: ESS, variable: trstep).

We start by subsetting the data to these countries:

Within these countries, we will look at average trust in the European Parliament of citi-

zens and non-citizens, using the same ctzcntr binary variable for citizenship that we

used in Example 9e. The ctzcntr variable needs to be binary if we are to run a t-test on

its two groups (citizens and foreigners).

The command to run a separate t-test for each country runs as follows:

The results of the first t-test (a test that we explain further in Section 10.4) show that,

in Greece, citizens are less inclined to trust the European Parliament than non-citizens.

The test compared the average trust score of both groups, which was measured on an

ordinal scale of 0 (no trust) to 10 (complete trust), and found out that the difference is

(44939 observations deleted)

. keep if inlist(cntry,"GR","PT","IE")

. bysort cntry: ttest trstep, by(ctzcntr)

!

90

negative: the “Yes” group of citizens shows less trust in the European Parliament (4.3)

than the “No” group of foreigners (5.8):

The latter group includes only a few observations (n = 72), and the standard error for

their average trust in the European Parliament is therefore higher (.30) than it is for

Greek citizens (.05). The standard deviations further indicate that the groups have a

relatively similar underlying distribution of trust scores. Still, the confidence intervals do

not overlap, and the t-test concludes that the –1.54 points difference in average trust

scores is statistically significant with only a very, very small risk of error, denoted by the

probability level of the alternative to the null hypothesis being close to, but not equal

to, “0.0000” (middle value).

The ‘lateral’ probabilities confirm the direction of the relationship: the difference in

scores between average trust among citizens, mean(Yes), and average trust among for-

eigners, mean(No), is highly likely to be negative and highly unlikely to be positive, in-

dicating higher trust among the latter group. To observe this difference and reach that

conclusion, two conditions are met: average trust is effectively different in both groups,

and the sample size of each group is large enough to establish that the difference is sta-

tistically significant.

Depending on the actual difference and on the number of observations available in

each group, the t-test might have a harder time at identifying any statistically significant

difference, as shown in the results for Ireland:

Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Ho: diff = 0 degrees of freedom = 1999

diff = mean(Yes) - mean(No) t = -5.1389

diff -1.54778 .3011897 -2.138458 -.957101

combined 2001 4.396802 .0564503 2.525167 4.286094 4.507509

No 72 5.888889 .3055022 2.592272 5.279735 6.498043

Yes 1929 4.341109 .0570617 2.50617 4.2292 4.453018

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

Two-sample t test with equal variances

-> cntry = GR

!

91

In these results, the difference in trust is still negative, confirming that foreigners are

more trustful of the European Parliament than citizens in Ireland too; the gap in aver-

age trust is smaller than it is in Greece, but the confidence intervals still do not overlap,

and the higher number of observations for foreigners allow the test to work with a

lower standard error for that group. The other results of the t-test are quite similar, in-

cluding the still very, very low probability levels.

Portugal offers a different picture. The gap in average trust is smaller, and the number

of foreigners in the sample data is very low. As a consequence of both factors, the con-

fidence intervals overlap and no statistically significant difference comes out of the test:

Example 10b. Female leaders and political regimes

Consider the following example, where we test the average level of democracy (on a

0–10 scale) between two groups of countries, governed respectively by male and fe-

male leaders (study: QOG; variables: p_democ and m_femlead).

Pr(T < t) = 0.0008 Pr(|T| > |t|) = 0.0016 Pr(T > t) = 0.9992

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Ho: diff = 0 degrees of freedom = 1668

diff = mean(Yes) - mean(No) t = -3.1533

diff -.5603897 .1777167 -.908961 -.2118185

combined 1670 4.697006 .0557944 2.280073 4.587572 4.80644

No 184 5.195652 .1675218 2.272377 4.86513 5.526175

Yes 1486 4.635262 .0589952 2.274187 4.51954 4.750985

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

Two-sample t test with equal variances

-> cntry = IE

Pr(T < t) = 0.1978 Pr(|T| > |t|) = 0.3956 Pr(T > t) = 0.8022

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Ho: diff = 0 degrees of freedom = 1942

diff = mean(Yes) - mean(No) t = -0.8497

diff -.3037495 .3574689 -1.004813 .3973136

combined 1944 4.363169 .0549027 2.420704 4.255494 4.470843

No 47 4.659574 .3777784 2.589919 3.899146 5.420003

Yes 1897 4.355825 .0554851 2.416629 4.247007 4.464643

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

Two-sample t test with equal variances

-> cntry = PT

!

92

The command and its results follow:

The results of the t-test seem to indicate that countries run by female leaders are signif-

icantly likely to be more democratic at the 95% confidence level (p < .05). What shall

we conclude from this test? In a nutshell, nothing:

− The dataset includes only 8 countries with female leaders, which leads to high

standard errors, wide confidence intervals, and makes the statistical significance

of the test misguiding. Maximising sample size would be a prerequisite to any

further tests using this variable.

− Furthermore, there is no theoretical support for drawing any form of conclusion

from these tests: for instance, female leaders like German Chancellor Angela

Merkel or Brazilian President Dilma Rousseff have not made these countries

dramatically more or less democratic, and while democratic elections might have

made their rise to power possible, so many other factors than regime type are at

play in selecting female heads of state that the test in itself is devoid of analyti-

cal substance.

− Identically, the statistical significance of the test cannot be taken as proof that

autocracies are equally likely to be ruled by male or female leaders: even a cur-

sory account of 20th century history will show that autocracies are very unlikely

to be led by female rulers, who are almost systematically excluded from the so-

cial groups that provide autocrats, such as high levels of military hierarchies. The

absence of female autocrats in recent history (one has to go back to 18th centu-

ry Russia to find a genuine female autocracy) makes the test even more absurd.

− Finally, the democracy and autocracy indexes, despite the fact that they origi-

nally come from the canonical Polity IV dataset, are criticisable (more precisely,

the intermediate levels of democracy and autocracy that they provide are highly

problematic). Additional flaws in the data, such as measurement error, are thus

likely to affect independence tests, which makes the jump from association to

causation a rather arbitrary one.

Pr(T < t) = 0.0199 Pr(|T| > |t|) = 0.0399 Pr(T > t) = 0.9801

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Ho: diff = 0 degrees of freedom = 155

diff = mean(0. Male) - mean(1. Femal) t = -2.0725

diff -2.896812 1.39774 -5.657889 -.1357348

combined 157 5.375796 .3106018 3.891829 4.762268 5.989324

1. Femal 8 8.125 .5153882 1.457738 6.906301 9.343699

0. Male 149 5.228188 .3218452 3.928621 4.592182 5.864193

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

Two-sample t test with equal variances

. ttest p_democ, by(m_femlead)

!

93

The few remarks above make an important point: if you cannot substantively justify

your test to match its statistical significance results, you will eventually fall short of say-

ing anything relevant about the independence of its groups. This situation, and many

others, fall under what is sometimes jokingly referred to as a “Type III” error, which

consists in giving the right (statistical) answer to the wrong (substantive) question. The

conventional “Type I” and “Type II” errors are addressed below.

10.2. Independence

Statistical tests provide proof by contradiction. Their logic consists in assuming that we

are wrong to assume any association between our variables, and that the two variables

are in fact unrelated—free of any association. This assumption is referred to as the null

hypothesis, noted “H0”. It would be absurd to think, for example, that voters or politi-

cal leaders who drink herbal tea are more likely to support racist ideologies: herbal tea

consumption and racism are (hopefully) independent from each other.

Consider a less absurd example: does religion have an effect on political support for

democracy? Hypothetically speaking, holding religious beliefs might affect political

views by increasing individual self-confidence and providing beliefs that either support

or reject democratic rule. The hypothesis might verify any side of the relationship: at

that stage, there is no support for a particular direction. Furthermore, holding religious

beliefs is not just an individual factor: when large groups of individuals hold religious

beliefs, other factors will come into play, such as group persecution or collective domi-

nance, which might also affect how each individual then views democracy. When

working with so many known unknowns, the only safe line of reasoning actually con-

sists in suspending all our former beliefs and… stay agnostic. The null hypothesis thus

states that religion and democratic support have no relationship whatsoever. More sub-

stantially, what the null hypothesis means is that virtually 100% of democratic support

can be explained by factors other than religion, such as socioeconomic factors and oth-

er contextual elements like the ones cited above.

Example 10c. Religiosity and military spending

Is the average level of religiosity in the population associated to the percentage of gross

domestic product spent on military expenditure? Before jumping to any conclusion,

consider the following:

− You might have good reasons to think that there is a positive association, if the

question reminds you of how particularly bellicose countries invoke religious

motives to justify, for example, some forms of ‘holy war’—but you already real-

ise, at that point, that ‘holy wars’ can explain only a tiny fraction of military

spending by states worldwide.

− You might also have good reasons to think that most practices of religion actu-

ally preach non-aggressiveness, and would therefore drive states to bring mili-

tary expenditure down in absence of popular support for it. At that stage, you

!

94

also realise that military expenditure is not necessarily subject to public opinion

pressures, and that even if it is, other factors are likely to come at play, with

possibly greater explanatory power.

− Finally, you will probably conclude that military expenditure and religion should

not be assumed to be associated by default: the most reasonable approach, and

the statistically correct one, is to adopt an agnostic stance by stating the null

hypothesis, which basically means that “we cannot know from the start, and

that there might just be an association, but only by rejecting the absence of any

relationship can we establish that.”

− Just for the kicks, you can open the QOG data and try to find a statistically sig-

nificant association between the wvs_rel variable (an average measure of “how

important God is” to the population) and the wdi_megdp variable (national mil-

itary expenditure as a % of GDP). It will quickly appear that any categorisation

of either variable is going to distort the data to the point where finding an asso-

ciation will primarily rely on your own manipulations.

You should mentally start any bivariate test with a null hypothesis: even when you are

testing a plausible association, such as between education and racism, you should con-

sider the null hypothesis: these variables are independent from each other, no associa-

tion exists between education and racism, virtually 100% of racism can be explained by

factors other than education (and vice versa).

Your test then proceeds by trying to reject the null hypothesis. More precisely, it will

provide a probability for the null hypothesis to be verified. That probability is expressed

as the

p

-value, which varies between 0 to 1. In order to reject the null hypothesis, you

will read the p-value: a p-value close to 0 indicates that the likelihood for the null hy-

pothesis to be verified is weak, whereas a p-value close to 1 indicates that there is high

likelihood for the null hypothesis to be correct.

Consequently, a bivariate test that reveals a statistically significant association will come

with a low p-value. The level below which you can reject the null hypothesis is called

the (alpha) level of significance. By convention, α = 0.05 in most circumstances, for no

other reason than the practical convenience of that decision rule. If p < 0.05, assuming

an association between your two variables comes with less than a 5% risk of assuming

an association where there is actually none—a situation called a “Type I” error, where

you reject the null hypothesis even though it is actually true. The reverse situation,

where you retain the null hypothesis while it is actually false, is called a “Type II” error,

and is frequent in small samples on which statistical tests produce less reliable results.

Note that this explanation confuses significance testing with hypothesis testing, which is

theoretically inaccurate, but acceptable for our purpose here. If you find a statistical

textbook that correctly reports the difference between Fisher’s p and Neyman-

Pearson’s α, then you are reading quite advanced textbooks. The small confusion made

here is technically mistaken in statistical reasoning, but it should reveal as problematic

as other confusions addressed elsewhere in this guide.

!

95

Based on what has just been outlined, you should try to minimize “Type I” and “Type

II” errors in your tests. If you need to establish higher certainty about an association, as

is often the case in studies involving chemicals because the consequences of a “Type I”

error might carry dramatic consequences for the people exposed to them, you will use

α = 0.01 and reject the null hypothesis only if p < 0.01, or even a lower threshold such

as p < 0.001 or p < 0.0001. In parallel, if you fear missing an association by retaining

the null hypothesis when you should have rejected it, thereby making a “Type II” error,

then you should maximize sample size and reduce the number of missing observations,

in order to maximize the number of observations for each variable of interest.

Example 10d. Religion and interest in politics

The Chi-squared tests below test for a relationship between having an interest in poli-

tics and belonging to a religion (data: ESS, variables: polintr and rlgblg). The tests are

respectively applied to French and Russian respondents:

As running the tests and reading the results will show, the association of both variables

is statistically significant in France (p < .05), but not in Russia (p > .05). We should

hence reject the null hypothesis for French respondents, and consider that a relationship

exists in this country between the two factors. In the case of Russia, however, we can-

not reject the null hypothesis and retain it. There is a small probability that we are

wrong in both cases: in the French case, rejecting the null hypothesis while it is actually

true would lead to a “Type I” error, and in the Russian case, retaining the null hypothe-

sis while it should have been rejected would lead to a “Type II” error.

Statistical tests such as the Chi-squared test operationalize the probabilistic logic de-

scribed above. Take, for example, the results of the above Chi-squared test for French

respondents, showing column percentages instead of frequencies:

The null hypothesis states that interest in politics is independent from religion. In that

case, there should be as many “very interested” respondents among French religious

. bysort cntry: tab polintr rlgblg, chi2

(46557 observations deleted)

. keep if inlist(cntry,"FR","RU")

Pearson chi2(3) = 12.5681 Pr = 0.006

Total 100.00 100.00 100.00

Not at all interested 12.38 15.17 13.81

Hardly interested 35.33 33.65 34.46

Quite interested 38.72 33.55 36.06

Very interested 13.57 17.62 15.66

politics Yes No Total

How interested in or denomination

particular religion

Belonging to

. tab polintr rlgblg if cntry=="FR", col nofreq chi2

!

96

believers and non-believers, but this is not the case: very high interest in politics

(15.06% of all observations) is over-represented among non-believers and under-

represented among believers.

Using the frequencies of each variable, the hypothetical frequencies of their crosstabu-

lation in absence of any relationship between them can be calculated; these expected

values are then computed against the observed frequencies to calculate the Chi-

squared statistic. This statistic is then combined to the degrees of freedom, which corre-

sponds to the number of cells in your crosstabulation of ethnicity and party preference,

minus one row and one column.

Your handbook details both computations, and will also give you a table which crosses

the Chi-squared statistic with its degrees of freedom to obtain the p-value of the asso-

ciation between both variables. The underlying logic of the Chi-squared test (which is

also called the ‘goodness of fit’ test) is essential to your understanding of how hypothe-

sis-testing works: make sure that you are familiar with it before moving on to further

techniques.

This short example yields a Chi-squared of 12.5 against 3 degrees of freedom, with a

corresponding p-value of “0.006” that really means “below 0.006”. The small p-value

allows us to reject the null hypothesis with great certitude, as there is less than a 0.6%

risk that we are making a “Type I” error. We can therefore reject the null hypothesis

and confirm our alternative hypothesis, which states that interest in politics and religion

are not independent but significantly associated in the French sample of respondents.

At that stage, however, we still lack a plausible theory to explain that association sub-

stantively. Even if we find an explanation, reading the whole crosstabulation will show

that the relationship seems more complicated than expected: religion does not increase

or decrease interest in politics uniformly across all groups. We will also want to make

sure that the effect of religion on interest in politics is not cancelled by, for example, the

average age of respondents in each group. The Chi-squared test leaves these questions

unanswered: more sophisticated tests will (start to) address them in Section 11.

10.3. Crosstabulations

Most bivariate tests combine two categorical variables, such as income groups, levels

of education, geographical regions or regime types. Crosstabulations of such variables

are especially frequent with survey data, where the answers given by the respondents

are coded as nominal variables, such as religion, or as ordinal variables on ‘short’ scales,

such as agreement scales that usually range over 3 to 12 items, from “Strongly agree”

to “Strongly disagree”.

These tests combine variables in a “r x c” (rows by columns) contingency table. The

intersection of each row with each column forms a cell that contains the number of ob-

servations (called a cell count) for that intersection. Tables can be made easier to read

!

97

with row and/or column percentages, but the type of test to use ultimately relies on cell

counts, as shown in the examples below.

Example 10e. Legal systems and judicial independence

An ever larger number of countries is running elections, but fraud or candidate intimi-

dation was reported in at least a fifth of the cases reported between 2001 and 2008

(study: QOG, variable: dpi_fraud). Among the countries affected by electoral fraud,

some are also former colonies of Western imperial powers:

The results above show Fisher’s exact test, which is superior to the Chi-squared test on

‘2 x 2’ contingency tables like the one above. The test produces a unique statistic that

can be read as a p-value for the likelihood of an accidental association to exist between

the variables. Here, the risk of a coincidental association between former colony status

and electoral fraud is far from meeting any reasonable level of significance. We can re-

tain the null hypothesis and look for other explanatory factors of electoral fraud.

The Chi-squared test is also inferior to Fisher’s exact test when some cells in the cross-

tabulation hold less than five observations. When that ‘5+’ convention is violated, Fish-

er’s exact test is recommended over the Chi-squared test.

There are many more tests available to test for association in categorical data. Cramér’s

V, for instance, is a test that complements the Chi-squared test by providing a measure

of strength for the association. More nonparametric tests were also designed to test for

particular associations:

− When both variables are ordinal, Spearman’s rho is a ranked correlation coeffi-

cient that better captures association than the tests mentioned above. An

equivalent test, Kendall’s tau, uses a different computational logic (closer to the

Gamma test) to achieve similar results.

− When both variables are interval/ratio (more simply: continuous), Pearson’s r

provides a correlation coefficient that we will explore in Section 11, where we

cover correlation and regression as stronger analytical tools that go beyond test-

ing for independence.

1-sided Fisher's exact = 0.234

Fisher's exact = 0.431

Total 129 34 163

1 53 11 64

0 76 23 99

colony 0 1 Total

Former Affection

Intimidation

Fraud or Candidate

. tab fcol dpi_fraud, exact

!

98

10.4. Comparisons of means

A common approach to some quantitative indicators will involve measuring a continu-

ous variable in two discrete groups, such as males and females. When a difference ap-

pears between these groups, it is often measured as a difference in means. This setting

is common in experiments, such as when we measure the average literacy of a group of

children who were given free schoolbooks against the average literacy of another group

of children whose parents had to pay for the same schoolbooks.

The comparison of means works by running a

t

-test, which computes the mean of a

continuous variable over two groups provided by a categorical or binary (dummy) vari-

able. The test compares the means by estimating whether their difference is statistically

significant. This method also appears in other tests, especially when the two groups can

be paired, as in the case of control and treatment groups. When conditions are met for

running an analysis of variance (ANOVA), as is common in psychological and clinical

studies, comparing means of a continuous variable (such as blood pressure) across two

groups of patients (e.g. those who received some medication and others who received

a placebo) is a standard technique to establish the causal effect of a given treatment.

Note: if your continuous variable is in fact coding for a binary outcome such as the re-

sult of a medical procedure that succeeds (1) or fails (0) to cure a patient, then the dis-

tributional assumption of normality on which the t-test relies will be violated. In that

case, you should use the prtest with exactly the same syntax as the ttest command, as

in prtest cure, over(gender). In the social sciences, this is relevant to dummy variables

coding for dichotomous outcomes such as the fact of being divorced (1) or not (0).

Example 10g. Gender and left-right political positioning

Political parties rely on different electoral clienteles and sometimes assume that posi-

tioning on the left–right spectrum significantly differs for men and women. A simple

test thus consists in measuring the mean left–right positioning of men and women for

several countries. We will start by looking at aggregate scores of left–right positioning

at the country level (data: ESS, variables: lrscale and gndr):

The graph, which was slightly scaled down with the scale option for cosmetic reasons,

ranks countries by the left–right score of males. It shows no consistent pattern for the

average scores of males and females at the macro level:

> exclude0 ylab(1 "Left" 10 "Right") ytit("") scale(.85)

. gr dot lrscale, over(gndr) asyvars over(cntry, sort(1) des) ///

!

99

The t-test then indicates an interesting result, which deserves some attention. In the

results below, the t-test is indeed statistically significant, but substantively insignificant.

We will read through each part of the test to reach that conclusion:

The interpretation of the t-test goes as follows:

− WRITE.

Left Right

ES

SI

DE

PT

FR

SK

CY

CH

BE

GB

BG

GR

SE

UA

IE

RU

NL

EE

CZ

DK

HU

NO

PL

FI

TR

IL

Male Female

Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Ho: diff = 0 degrees of freedom = 43300

diff = mean(Male) - mean(Female) t = 4.2529

diff .0918305 .0215926 .0495087 .1341524

combined 43302 5.163295 .0107862 2.244507 5.142154 5.184436

Female 22682 5.119566 .0146951 2.213164 5.090763 5.14837

Male 20620 5.211397 .0158609 2.277571 5.180308 5.242485

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

Two-sample t test with equal variances

. ttest lrscale, by(gndr)

!

100

− From the theoretical part of the course, you should remember at that point that

the p-value for the alternative hypothesis is derived from the t-statistic, and that

the sum for both directional hypotheses diff < 0 and diff > 0 will always be 1).

Example 10h. Obesity and racial-ethnic profiles

We continue to explore the Body Mass Index of U.S. respondents (study: NHIS, varia-

ble: sex, raceb and bmi, as previously calculated in Example 9a), by looking at the

breakdown of BMI by gender groups and four main racial-ethnic profiles. We start by

plotting the average BMI for four ethnic groups (variable raceb). Graphically, we want

three separate dot plots showing the average BMI for each racial-ethnic profile among

males and females separately and for both sexes:

In this presentation, the three graphs are horizontally aligned by using the rows option

to fit them into a single column of three rows. The plot for both gender groups is fur-

ther generated by the total option. Through visual inspection of the variables of inter-

est, we can establish whether gender and race might account for some variation in the

Body Mass Index of U.S. respondents:

The sum of observations that could be derived from this graph cannot be brought to-

gether into a single bivariate test; instead, multiple linear regression will be used to de-

scribe the joint effects of gender and race on BMI (Section 11). A single comparison can

focus, however, on the markedly lower BMI of Asian respondents in reference to all

other racial-ethnic profiles.

> by(sex, rows(3) total note("")) ytit("Average Body Mass Index")

. graph dot bmi, over(raceb) ///

010 20 30

Asian

Hispanic

Black

White

Asian

Hispanic

Black

White

Asian

Hispanic

Black

White

Male

Female

Total

Average Body Mass Index

!

101

The t-test will not accept more than two values for the independent categorical varia-

ble, so we had to create a dichotomous (or binary) variable, coding “1” for “Asian”

and “0” for any other racial-ethnic profile. As we do so, we must be careful to prevent

missing data from being coded as “0”, as it would distort the data if some respondents

did not report their racial-ethnic profile.

We do not need to use recode to generate a full-fledged variable with proper labels at

that stage: a dummy that we will create on the fly is enough. A single line of code gen-

erates that dummy through the logical statement (raceb==4), which returns 0 when

false and 1 when true. Where the raceb variable indicates the value for “Asian”

(raceb==4), it will code 1 and 0 otherwise.

Using this statement, we create the dummy variable asian for all non-missing observa-

tions of raceb, using the additional logical statement if !mi(raceb) to that effect, and

then finally check our operation with the su command, for which the mean indicates

the percentage of Asians in the sample:

We then run a t-test for BMI between Asians and non-Asians:

The interpretation of the t-test goes as follows:

− Out of N = 24,291 respondents for which we could calculate a Body Mass In-

dex, the average BMI approaches 27.4 for n = 22,920 non-Asian respondents

asian 24291 .0564407 .2307754 0 1

Variable Obs Mean Std. Dev. Min Max

. su asian

. gen asian=(raceb==4) if !mi(raceb)

Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Ho: diff = 0 degrees of freedom = 24289

diff = mean(0) - mean(1) t = 22.0861

diff 3.121622 .1413385 2.84459 3.398654

combined 24291 27.27 .032942 5.134197 27.20543 27.33457

1 1371 24.32456 .1037125 3.840163 24.12111 24.52801

0 22920 27.44618 .0340063 5.14833 27.37953 27.51284

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

Two-sample t test with equal variances

. ttest bmi, by(asian)

!

102

and 24.3 for n = 1,371 Asian respondents. The standard error is larger for

Asians, since we have less observations for that group.

The difference in means shows that the BMI of Asians is inferior by roughly 3

points to the BMI of non-Asians in the United States. Furthermore, the standard

deviation of BMI within the Asian group is smaller than for the rest of the sam-

ple, indicating that the distribution of BMI among Asian respondents around its

mean is more compact.

Pause at that stage and interpret substantively the difference. Body Mass Index

follows an international standard, where a BMI of 25 indicates overweight.

Therefore, the sample average respondent is overweight by our results. Howev-

er, this does not to Asian respondents, who are slightly below, but not that

much below, that conventional threshold.

− The null hypothesis (Ho) predicts that the difference in means between Asian

and non-Asian respondents is null (Ho: diff = 0). If the difference is non-null,

the null hypothesis further estimates the probability for that difference to be

due to sampling error (although it naturally cannot correct for measurement er-

ror).

The null hypothesis hence tests the following statement: “If the Body Mass In-

dex is strictly independent from race, then any difference in means between

Asians and non-Asians is accidental.” Rejecting the null hypothesis amounts to

rejecting that statistical statement, which concerns statistical significance; addi-

tional observations about the causes and reasons of that difference will require a

substantive theory, such as a difference in physiological and nutritional determi-

nants among Asian respondents.

The t-test actually shows a difference, noted diff = mean(0) – mean(1), that ap-

pears to be statistically robust, therefore contradicting the null hypothesis. In

this example, on average, the Body Mass Index of Asians (the group for which

the by variable, asian, takes the value 1) is lower to the BMI of non-Asians

(group 0) by approximately 3 points (diff = 3.12, 95% CI = 2.84–3.39).

− The alternative hypothesis (Ha) predicts that there is a meaningful association

between race and Body Mass Index, which should cause the average BMI of

Asians to differ substantively from the average BMI of non-Asians. This hypoth-

esis implies that the difference in average BMI between both racial-ethnic pro-

files should be significantly different from zero (Ha: diff != 0).

The p-values for the t-test (Pr) indicate that we can reject the null hypothesis

Ho because the p-value for (Ha: diff !=0) is inferior to our level of significance,

α = 0.05. At that stage, we gain empirical confirmation of what we previously

observed graphically. More precisely, the test shows that subtracting the mean

BMI of Asians to the mean BMI of non-Asians is very likely to give a positive re-

sult: the probability level for Ha: diff >0, is highly significant (p < .01).

!

103

Interpreting a t-test requires reading all the information used in this example,

but reporting a t-test is usually much quicker. Comparing means between

groups that fit with reasonable theoretical expectations generally just requires

reporting the existence of a significant difference. Other results are less im-

portant, given that we will obtain more precise estimates of the difference by

including racial-ethnic profiles with other variables into our regression model.

Example 10i. Political regime and female legislators

The t-test can quickly run into issues of statistical significance if it is run on a low num-

ber of observations. The following example tests the hypothesis according to which

federal regimes lead to higher representation of women in parliaments. The hypothesis

could be tested only a small group of countries for which the data were available (data:

QOG, variables m_wominpar and pt_federal):

A first issue here has to do with the large standard errors caused by the low number of

observations, but we have no other choice of data. A second issue then has to do with

the large standard deviations, which indicate that the mean does not capture well the

distribution of women in parliament. Finally, an issue here could be that both variables

were measured at different points in time, but we expect regime type to be stable.

The last issue is serious, as shown by two kernel density plots that show the distribution

of the m_wominpar variable for each regime type. The code for the graph is pretty eso-

teric, as it involves combining two graphs with the cryptic tw (“two-way”) operator,

using additional operators () || () to separate them:

Pr(T < t) = 0.2310 Pr(|T| > |t|) = 0.4619 Pr(T > t) = 0.7690

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Ho: diff = 0 degrees of freedom = 79

diff = mean(0. No fe) - mean(1. Feder) t = -0.7393

diff -2.362783 3.196071 -8.724404 3.998838

combined 81 16.43951 1.169831 10.52848 14.11147 18.76754

1. Feder 13 18.42308 2.879484 10.38213 12.14922 24.69693

0. No fe 68 16.06029 1.284192 10.58972 13.49704 18.62355

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

Two-sample t test with equal variances

. ttest m_wominpar, by(pt_federal)

> legend(lab(1 "Federal regimes") lab(2 "Non-federal regimes"))

> xtit("Women in parliament (%)") ///

> (kdensity m_wominpar if !pt_federal), ytit("Density") ///

. tw (kdensity m_wominpar if pt_federal) || ///

!

104

Against our initial insight, the resulting graph indicates that non-federal regimes actually

reach into higher values of women in parliament, whereas the rest of the distribution is

pretty similar for each regime type:

The interpretation of the t-test goes as follows:

− In the N = 81 countries for which we have data for the year 2008, women oc-

cupy an average of 18% of parliamentary seats in federal regimes, and an aver-

age of 16% in non-federal regimes. The difference in means hence shows that

the percentage of women in parliaments is inferior by roughly 2 percentage

points in non-federal regimes.

− The null hypothesis (Ho) predicts that the observed difference is caused by

sampling error, and is therefore accidental (or coincidental) rather than signifi-

cant (Ho: diff = 0). The alternative hypothesis (Ha) states that the difference re-

flects a significant association, and that the difference in the average number of

women in parliament is not null (Ha: diff != 0) but rather corresponds to a sub-

stantive difference between both regimes.

− The last lines of p-values (Pr) indicate that we cannot reject the null hypothesis,

as the p-value for Ha: diff !=0 is highly superior to any reasonable level of sig-

nificance. We should conclude that, for the countries under examination, feder-

al rule has no significant incidence on the representation of women in parlia-

ments.

0.01 .02 .03 .04 .05

Density

010 20 30 40

Women in parliament (%)

Federal regimes Non-federal regimes

!

105

10.5. Controls

An interesting use of bivariate tests resides in identifying control variables, which are

independent variables that might have affect other independent variables, therefore

mitigating the interpretation of the relationships that you might come to observe be-

tween your variables.

Example 10f. Party support (with controls)

The test featured in Example 10c has confirmed an association between party support

for the British National Party (BNP) and gender, with men being more likely to support

the BNP than women (study: BNP, variable: ). Other factors come into play: for in-

stance, a t-test would reveal that members of trade unions are less likely to support the

BNP than non-members.

If, in turn, women are more likely to be members of trade unions, it is difficult to know

if BNP support is influenced by gender or by trade union membership. Similarly, men

who are members of trade unions might be even less supportive of the BNP than fe-

males who are also trade union members, which would make the relationship more

complex.

Using the graph hbar command with two over options, we can plot BNP support over

both gender and trade union membership. In the resulting graph below, it becomes ap-

parent that gender still influences BNP support, even when controlling for trade union

membership:

1.26596

1.265961.57647

1.57647

1.576471.70663

1.70663

1.706631.97619

1.97619

1.976190

0

0.5

.5

.51

1

11.5

1.5

1.52

2

22.5

2.5

2.5Feelings towards the BNP

Feelings towards the BNP

Feelings towards the BNPYes

Yes

YesNo

No

Nofemale

female

femalemale

male

malefemale

female

femalemale

male

!

106

The code for the graph shows that we modified several things (we added bar labels,

and we also modified the title and scale of the axis):

graph hbar bnp, over(sex) over(union2) ylabel(0(.5)2.5) yti-

tle("Feelings towards the BNP") blabel(bar)

We could also ‘control’ for the effect of gender on party support by verifying that, if

men tend to be more supportive of the BNP, an extreme-right party, they also tend to

be less supportive of political parties on the opposite end of the right/left spectrum.

The graph below shows support for the BNP and for several other parties, still broken

by gender. As it appears, men are indeed less politically supportive of parties situated

on the left; the difference in support becomes negligible on the right, and becomes vis-

ible again on the extreme-right.

Once again, the code for the graph comes with different modifications of the axis, bar

labels, and legend:

graph hbar bnp con green lab, over(sex) legend(label(1 "BNP")

label(2 "Conservatives") label(3 "Green Party") label(4 "La-

bour")) blabel(bar) ylabel(0(1)6)

As the language above suggests, the differences and interpretations that we have given

are entirely tentative, since they are based on graphical comparison. By running the

95% confidence intervals for the average support of men and women to each party

(displayed below), we can see that the differences are not actually robust to the stand-

ard error of the mean: the confidence intervals for male and female BNP supports over-

lap, which means that the difference in support between males and females might be

attributable to sampling error.

4.68682

4.686824.57834

4.57834

4.578344.93448

4.93448

4.934481.61068

1.61068

1.610684.54186

4.54186

4.541864.37626

4.37626

4.376264.94836

4.94836

4.948361.89549

1.89549

1.895490

0

01

1

12

2

23

3

34

4

45

5

56

6

6female

female

femalemale

male

maleBNP

BNP

BNPConservatives

Conservatives

ConservativesGreen Party

Green Party

Green PartyLabour

Labour

!

107

Bivariate tests, along with other procedures such as confidence intervals, should hence

lead you to think deeper about the possible associations between your variables. Once

you have sufficiently explored these relationships, you should move to a statistical pro-

cedure that will allow to estimate a particular form of relationship between two varia-

bles while controlling for the effect of several other variables: linear regression.

One last word of caution before moving to regression models: remember that statisti-

cal significance is not substantive significance. More precisely, statistical significance is

neither necessary or sufficient for substantive significance: some associations can be

statistically significant and yet devoid of substance, whereas others can be substantively

significant and yet too weak on independence tests. Types I and II errors stay possible

even after all possible tests, due to the fact that you are using probabilities with confi-

dence intervals and standard errors.

Consequently, your own capacity to interpret the data cannot be replaced by a statisti-

cal test. Instead, the tests should be used as heuristics: they should lead you to reflect

further on the quality and reliability of your data, and on your predictions regarding

your dependent and independent variables.

lab 1009 4.686819 .0847027 4.520605 4.853032

green 868 4.578341 .0765306 4.428134 4.728548

con 992 4.934476 .0796756 4.778124 5.090828

bnp 899 1.610679 .0721949 1.468988 1.752369

Variable Obs Mean Std. Err. [95% Conf. Interval]

-> sex = female

lab 860 4.54186 .0893638 4.366464 4.717257

green 792 4.376263 .0830203 4.213297 4.539229

con 852 4.948357 .0850586 4.781408 5.115306

bnp 842 1.895487 .088887 1.72102 2.069953

Variable Obs Mean Std. Err. [95% Conf. Interval]

-> sex = male

. bysort sex: ci bnp con green lab

!

108

11. Regression

While correlation is enough to establish and qualify a relationship between two vari-

ables, linear regression adds some predictive value to your analysis. In statistical anal-

ysis, prediction consists in identifying an equation that can predict approximate values

of a dependent variable from the values of independent variable(s).

Linear regression is a form of statistical modelling that predicts a dependent variable

from a linear, additive relationship between one or more variables. In a simple linear

regression, there are two variables, one dependent (explained) and one independent

(explanatory). In multiple linear regression, there is still only one dependent variable

but many more independent variables.

As its name indicates, linear regression captures only linear relationships. If your varia-

bles are related in any other way, as in exponential or curvilinear relationships (think of

the Kuznets curve in environmental economics), your regression will reflect it only very

poorly (if at all). Variable transformation as presented in Section 9.3 might or might not

solve that issue; other techniques that reach beyond linear regression modelling can

then be used to better model quadratic or polynomial relationships.

A final word of caution: regression is not causation. While regression allows you to

identify predictive relationships between independent and dependent variables, it does

not allow you to identify a causal link between them. Only a substantive theory can

causally relate, for instance, income with life satisfaction, or gross domestic product

with low prevalence rates of HIV/AIDS. With linear regression, you will find coefficients

to relate both variables, but the causal link that might exist between them is a matter

of interpretation at that stage.

11.1. Theory

The steps followed by regression modelling are fairly identical to those that you fol-

lowed when you analysed the distribution of your variables:

− Start by plotting your data to understand what you are working with. When

working on a single variable, you used histograms and frequency tables (Section

9). With two variables or more, you will be looking at scatterplots and scatter-

plot matrixes to look for linear relationships.

− Continue by summarizing your data using numerical measures. When working

on a single variable, you used central tendency and dispersion. These measures

are still relevant at that stage. You will add correlation measures to identify pat-

terns between pairs of variables.

− Finish by testing a model. When looking at a single variable, you tested the

normality of the variable (Section 9.3). When looking at several variables using

!

109

regression modelling, you will be testing the existence of a linear relationship

between them.

Note that you will not be able to read regression results properly without a clear under-

standing of what the standard deviation is. You will also need to read p-values as well

as F and t statistics. This guide does not cover the full detail of the theory behind re-

gression. Thus, you should turn to the corresponding handbook chapters before going

further with the analysis.

Identically, prior to reading regression results per se, you will need to read some corre-

lation coefficients, which are quite straightforward: Pearson’s r is a number ranging

from -1 to +1, with proximity to either -1 or +1 indicating a linear relationship; the cor-

relation itself can be significant or not, and therefore also comes with a p-value.

Finally, some caveats related to those mentioned in Section 10.1 apply:

− Correlation is not causation. Correlation is symmetric: it tests for the strength

of any relationship between two variables, and the relationship can theoretically

go both ways. It is only by substantive interpretation that you can make an

asymmetrical causal claim, as in “age causes religiosity”, because the causal ar-

row between these two variables can go only one way (i.e. the counterfactual is

impossible: religiosity cannot influence age, whereas it is possible for age to in-

fluence religiosity).

− Beware especially of the ecological fallacy, which might wrongly lead you to

make inferences about the individuals of your sample while looking at group-

level data. For instance, the high life expectancy in France is a group-level char-

acteristic that does not apply at the individual level—otherwise, speaking French

while smoking and drinking would protect you from developing cardiovascular

disease or cancer. Similarly, a U.S. Republican candidate can come first in poor

U.S. states even if poor voters tend to vote for the Democrat candidate.

11.2. Assumptions

For the linear regression model to run properly, it should be applied to a continuous

dependent variable. If your dependent variable is categorical, you can still run your re-

gression as long as the variable has a scale: you can typically treat an ordinal variable as

numeric (continuous).

All kinds of variables can be used as your dependent variable: gross domestic product,

an electoral score, a scale of life satisfaction or level of education, even a binary out-

come, such as “democratic or not” (0/1)—can all be submitted to regression analysis.

There are other assumptions and techniques that apply to regression but that we will

ignore, because the course is introductory and limited in scope:

!

110

− We will not cover selection techniques (such as nested regression) that can be

used to pick the right independent variables for your regression to reach its

highest predictive value. Instead, we will stick with the independent variables

that you chose by intuition and prior knowledge.

− We will not go in depth into categorical regression models that apply specifical-

ly to binary outcomes (logit and probit), nominal dependent variables (multino-

mial), and so forth. Regression with categorical data can use very sophisticated

models, taught in intermediate courses.

− Finally, we will not run regression diagnostics [actually, we just might] which

consist in studying the residuals of your regression model in order to validate

the other assumptions to regression analysis.

The fact that we will not take these assumptions and techniques into account does not

invalidate your regressions straight away. Even if the predictive value and precision of

your final model could have been improved, your work will still yield interesting results.

Linear regression knows many variants. Ordinary Least Squares (OLS) is only one

method, as is its expanded version, Generalized Least Squares. Another version, two-

step least-squares regression (2SLS regression), is equally useful. [Explain these briefly.

Handbooks rarely provide such an overview.] Another very useful one, weighted least

squares regression, can be used to detect nonlinear relationships in scatterplots by fit-

ting a locally weighted scatterplot smoothed curve (LOWESS curve).

Example 11a. Foreign aid and corruption

This example uses several graph options and combines a linear fit to a quadratic fit and

LOWESS curve, in order to show the many problems that a simple linear regression can

obscure (study: QOG, variables: ti_cpi and wdi_aid with some light recoding). The cor-

relation between foreign aid distributed as development assistance and an index of cor-

ruption is indeed satisfying (r = – 0.265, p < 0.01), but a graphical look at the linear fit

shows that it poorly represents the data. Furthermore, the quadratic fit and LOWESS

curve show a nonlinear relationship that we would miss with a simple linear regression.

Finally, the scatterplot also reveals several outliers, like Singapore, Israel and Bangla-

desh.

!

111

This example illustrates the many difficulties of regression modelling: even at the level

of two variables, providing a faithful account of a relationship can require complex

transformations or fairly advanced techniques. It can take a very long time to produce a

satisfying model, and even longer to produce a meaningful interpretation based on its

statistical results.

Once you also begin to consider measurement issues, omitted variable biases and pos-

sible variations over time, the analysis reaches a level of complexity that requires full-

time specialization in quantitative methods.

11.3. Correlation

Correlation is the most straightforward way to test for independence between two con-

tinuous variables, by looking for a pattern in their covariance. Stata lets you build corre-

lation matrixes, from which you can read correlation coefficients for any number of

variables. Correlations can be partial or ‘semipartial’ and can be computed onto differ-

ent subsamples, and each method comes with strengths and weaknesses.

When running a correlation, use the pwcorr command to select all observations for

which the values of both variables are available. The ‘pw’ prefix in the name of the

command stands for pairwise deletion of missing data: it means that this command

will calculate each correlation coefficient by using all the observations for which the pair

of variables used by the calculation are available.

An important problem with pairwise deletion is that each correlation coefficient ends up

being calculated on a different part of the sample held by the dataset, i.e. on a different

AFGALB

DZA

AGO

AZE

ARG

BHR

BGD

ARM

BRB

BTN

BOL

BIH

BWA

BRA

BLZ

SLB

BGR

MMR

BDI

BLR

KHM CMR

CPV

CAF

LKA

TCD

CHL

CHN

COL

COM

COG COD

CRI

HRV

CUB

CYP

CZE

BEN

DMA

DOM

ECU

SLV

GNQ

ETH

ERI

EST

FJI

DJI

GAB

GEO

GMB

GHA

KIR

GRD

GTM

GIN

GUYHTI

HND

HUN

IND

IDN

IRN

IRQ

ISR

CIV

JAM

KAZ

JOR

KEN

KOR

KWT

KGZ

LAO LBN

LSO

LVA

LBR

LBY

LTU

MDG

MWI

MYS

MDV MLI

MLT

MRT

MUS

MEX

MNG

MDA

MNE

MAR

MOZ

OMN

NAM

NPL

VUT

NIC

NER

NGA

PAK

PAN

PNG

PRY

PER

PHL

POL

GNB

TLS

QAT

ROU RUS

RWA

LCA

VCT

STP

SAU

SEN SRB

SYC

SLE

SGP

SVK

VNM

SVN

SOM

ZAF

ZWE

SDN

SUR

SWZ

SYR

TJK

THA

TGO

TON

TTO

ARE

TUN

TUR

TKM UGA

UKR

MKD

EGY

TZA

BFA

URY

UZB

VEN

WSM

YEMZMB

2 4 6 8 10

0500 1000 1500 2000

Net Dev elopment Assis t ance and Aid (Current Million USD)

Corrupt ion Percept ions I ndex Linear ﬁt

Quadratic ﬁt LOWESS

Sources: Transparency International, World Bank. Lowess curve bandwidth = 1.

!

112

subsample. This creates serious issues of external validity: it not only limits the ability

to compare correlation coefficients obtained under that method, but it also threatens

the possibility to generalize them to the population represented by the sample.

When building a correlation matrix, it is generally more reasonable to deal with missing

observations by excluding all observations for which any of the variables are missing.

This method, called casewise, or listwise, deletion of missing data, is implemented by

the corr command. Still, depending on your data structure, this method might result in

excluding a very large fraction of the observations contained in your dataset when cal-

culating correlation coefficients, which again threatens the representativeness of your

sample.

The problems outlined above are critical when your data contains many missing obser-

vations, which might excessively distort the correlation coefficients and limit generaliza-

tion. There is no statistical solution to these issues because they emerge at the level of

data collection and might only be solved at that stage. An acceptable procedure con-

sists in adding the obs option to the pwcorr command, in order to get the number of

observations used in calculating each correlation coefficient. Any important variation in

these numbers should be interpreted as a threat to external validity, which you should

take into account while interpreting the correlation matrix.

Add the sig option to your pwcorr command to obtain significance levels in your corre-

lation matrix. For improved reading, use the star(.05) option to add a star next to statis-

tically significant correlations at p < .05. The strength at which you should start consid-

ering a correlation is a substantive question that depends on your research design, but a

value of 0.5 is usually a good start to identify strong correlations, and a value of 0.25

might identify moderate correlations.

Due to their multiple issues of validity, you should refrain from drawing strong infer-

ences from correlation matrixes. Use correlation coefficients for explorative purposes,

to refine your intuitions and hypotheses. Once you have understood Pearson’s r and

the explorative potential of correlation matrixes, more robust results will be provided by

linear regression.

Example 11b. Trust in institutions

Trust is a common measurement in social surveys that usually applies to either people in

general or to specific social and political institutions like parliaments, politicians or the

police force. We decided to focus on institutional trust (study: ESS, variables: all starting

with trst, which is coded with an asterisk: trst*) to illustrate how trust correlates be-

tween institutions:

!

113

The correlation matrix reproduced above shows moderate-to-strong correlations be-

tween many institutions. For instance, there is a strong association between the trust

scores of two supranational organizations with ruling parliaments, the European Parlia-

ment and the United Nations (r = 0.72). That association is actually less intense be-

tween the European Parliament and national parliaments (r = 0.48). Identically, there is

a remarkably strong correlation of trust scores for politicians and for political parties (r =

0.87), but the associations between these two scores and the trust score for the legal

system are less marked, despite the importance of the rule of law in guaranteeing dem-

ocratic electoral competition.

All observations drawn from correlational analysis are tentative. The most robust find-

ings actually come from the absence of any significant correlation, which can designate

mutually exclusive situations, as in a measure of voting preference for several political

candidates: if constitutional rules are set to organise a uninominal ballot, then the elec-

tion is a zero-sum game between the candidates and voters are likely to polarise their

opinions and reject all candidates but one.

However, when the correlation matrix shows very strong associations between two or

more independent variables, then you can start diagnosing potential issues of multicol-

linearity in your future regression model. Multicollinearity is the situation where inde-

pendent variables influence each other in a significant way that is not captured in your

model, where the focus is instead set onto the dependent variable. Section 11.5 covers

multicollinearity in more detail.

11.4. Interpretation

Linear regression in Stata uses the regress command, followed by two variables for a

simple linear regression and any number of variables for a multiple linear regression.

The list of variables depends on your research design and on the results of your previ-

ous bivariate tests.

To interpret correctly a linear regression model, you should use robust standard errors

(using the robust option) and then focus on the following results:

trstun 0.4441* 0.4354* 0.4199* 0.4888* 0.4961* 0.7175* 1.0000

trstep 0.4776* 0.4240* 0.3576* 0.5307* 0.5418* 1.0000

trstprt 0.6657* 0.5499* 0.4667* 0.8691* 1.0000

trstplt 0.7005* 0.5745* 0.5051* 1.0000

trstplc 0.5527* 0.6813* 1.0000

trstlgl 0.6674* 1.0000

trstprl 1.0000

trstprl trstlgl trstplc trstplt trstprt trstep trstun

. pwcorr trst*, star(.05)

!

114

− The number of observations reflects the subsample used to perform the regres-

sion. This subsample is created by casewise deletion, as explained above. A low

number of observations will limit the validity of the model.

If you are facing a very low number of observations, you will need to remove

the variables that are causing that number to drop, to increase the validity of

the model for your sample population. If one of your independent variables is

available for less than 30 observations, remove it and run your regression with-

out it.

− The p-value of the model (Prob > F) should be below your alpha level of signifi-

cance. The separate p-values for the coefficients in your model should obey the

same rule.

The p-values can be read independently: if a categorical variable returns both

high and low p-values on its dummies, interpret them separately. For instance, if

your model includes a variable that defines religious denomination, it might

happen that the variable produces a significant effect only for some religions

and not others.

− The R-squared statistic indicates the predictive value of your model. It can be

read as a percentage: an R-squared value of .08 indicates that your model pre-

dicts the variance of your dependent variable by only 8% (whereas efficient

models will usually predict over 80%).

An issue with the R-squared statistic is that it will mechanically increase with the

number of variables included in the regression model. To control for that effect,

you should read the adjusted R-squared (Adj R-squared) if your model includes

a large number of independent variables. This issue disappears with standard-

ised coefficients, as explained below.

− Finally, the coefficients for both your variables and your constant (noted _cons

and also known as the intercept) are the parts of the model that you will inter-

pret. To make them comparable, you need to standardise them across variables,

using the beta option.

Technically, the coefficients establish the amount of variation in your dependent

variable that occurs for a variation of one unit in each of your independent vari-

ables. This variation can be interpreted straightforward only for continuous da-

ta, and requires more thought for categorical data.

[Example suggested by Dawn Teele] AJR's Colonial Origins of Comparative Develop-

ment, p.1378: "To get a sense of the magnitude of the effect of institutions on perfor-

mance, let us compare two countries, Nigeria, which has approximately the

25th percentile of the institutional measure in this sample, 5.6, and Chile, which has

approximately the 75th percentile of the institutions index, 7.8. The estimate in column

(1), 0.52, indicates that there should be on average a 1.14- log-point difference be-

tween the log GDPs of the corresponding countries (or approximately a 2-fold differ-

!

115

ence-e1.14-12.1). In practice, this GDP gap is 253 log points (approximately 1-fold).

Therefore, if the effect estimated in Table 2 were causal, it would imply a fairly large

effect of institutions on performance, but still much less than the actual income gap be-

tween Nigeria and Chile."

Example 11c. Subjective happiness

The Quality of Government dataset includes a series of indicators that report self-

assessed happiness. One these indicators consists in a mixed measure that combines life

satisfaction and a subjective assessment of one’s life, from “best possible” to “worst

possible”. Both measures use standardised psychometric scales that provide an indicator

of individual happiness in the 0–1 range, multiplied by life expectancy at birth (study:

QOG, variable: wdh_lsbw95_05, renamed life for convenience).

The steps that we took prior to running the linear regression model include data prepa-

ration, description and visualization. Each step is covered, with comments, in the

qog_reg.do file provided in Appendix A. We then thought of a list of potential predic-

tors for this dependent variable, which can be summarised along the following catego-

ries:

− Security, measured as the absence of social, economic or political crisis

− Wealth, measured as national affluence, free markets and low corruption

− Freedom, measured as free speech, gender equality and democratic life

− Health, measured as high life expectancy and low infant mortality

− Education, measured as high educational attainment among both sexes

Looking at the variables in the Quality of Government dataset, we found many varia-

bles that could fit each part of the model, which is far from perfect—for example, the

‘Health’ component is partly redundant (or collinear) with the dependent variable,

since the happiness indicator is already calculated against life expectancy. Identically,

infant mortality, life expectancy and educational attainment (measured as average

schooling years) are heavily correlated, which would lead to measure the same variable

twice, and since corruption is an obstacle to free markets, it is likely to be measured

twice if we include it as two separate independent variables. Consequently, we re-

moved some variables from the model after looking at a few correlations.

The final model tests three series of variables, corresponding to our three models of

happiness as Security, Wealth and Freedom. The table below summarises the respective

results of the models, and as shown, each of them carry statistically significant results

that can be substantively interpreted. Improvements of the model would include nor-

malizing some of the variables and applying some other diagnostics, quickly reviewed in

the next section.

!

116

Table 1. Estimated Effects of Security, Freedom and Wealth on Subjective Happiness

Model 1

Security

Model 2

Wealth

Model 3

Freedom

Model

1+2

Model

1+2+3

Failed state

– 0.69***

(0.03)

0.02

(0.05)

– 0.05

(0.05)

Gross domestic product

0.39***

(0.00)

0.40**

(0.40)

0.39**

(0.00)

Market governance

0.45***

(2.37)

0.47***

(2.38)

0.46***

(2.26)

Freedom of speech

0.28**

(2.38)

0.25**

(1.91)

Freedom of the press

– 0.25

(0.11)

0.39**

(0.09)

Women’s social rights

0.31**

(1.35)

0.00

(1.10)

Electoral process

0.30

(0.91)

0.52**

(0.77)

Political process

– 0.39

(0.81)

– 0.31

(0.66)

Constant (or “Intercept”)

63.16***

(1.65)

35.93***

(1.62)

37.49**

(11.56)

35.23***

(4.32)

17.00*

(9.03)

Observations (or just “N”)

93

91

94

90

R-squared

0.48

0.66

0.45

0.66

0.73

Standardised beta coefficients; robust standard errors in parentheses.

Significance levels: * significant at 10%, ** significant at 5%, *** significant at 1%.

[WRITE: Dummy variables] We might finally want to add a dummy variable to Western

countries to see whether Western democracy really has an advantage over the rest of

the world in terms of the subjective happiness of its citizens. The simplest way to do

this consists in using the xi: reg command with i.ht_region, but the tab ht_region,

gen(region_) command will also work.

[WRITE: Interactions]

11.5. Diagnostics

Linear regression models have the capacity to reveal many different aspects of your da-

ta, such as multicollinearity when several independent variables in the model all re-

volve around the same factor, and thus lead to estimating the same factor several times

through different measurements. For instance, if you include monthly and annual in-

come in your model, you are controlling twice for a single factor, measured in two dif-

ferent ways. Multicollinearity will affect your model by calculating separate coefficients

for “independent” variables that are actually correlated, hence creating an issue in your

linear model by including redundant information about your dependent variable.

!

117

Variance inflation factors (VIF) obtained with the vif command allow to assess for mul-

ticollinearity. As a rule of thumb, the factors should stay below 10 (or, alternatively,

their tolerance should stay above 0.1). If your VIF diagnosis finds multicollinearity, your

independent variables include some collinear ones, which will affect the measure of re-

gression coefficients.

Another issue that might affect your regression model is heteroscedasticity, which des-

ignates a violation of the normality assumptions under which linear regression operates.

Specifically, linear regression posits homoscedasticity, which stands for equal (or con-

stant) variance in your independent variables. This is an important dimension of any

regression model.

Under equal variance, a plot of your regression residuals should show no pattern

among them. If a nonlinear pattern appears, or if the distribution of the residuals

around the fitted values is simply not close to uniformity, then the data violates the lin-

ear assumption of your model. The rvfplot command diagnoses that issue by producing

a plot of residuals-versus-fitted values.

Finally, some formal tests exist to detect heteroscedasticity. The imtest hottest [WRITE]

Each issue will have different consequences on your linear regression model. Multicol-

linearity will increase the standard error of your coefficients and obscure the interpreta-

tion of the results. Heteroscedasticity indicates that a linear model is not reflecting the

data correctly, and that a nonlinear model should be used.

At that stage, you should also detect influential observations using a measure called

Cook’s D, and outliers, using tools akin to those described in Section 9.4. The studen-

tized residuals [WRITE]

[EXAMPLE CONTINUING 11.4]

!

118

12. Cheat sheet

This section summarises the most useful commands used during the course and in this

guide to produce the kind of statistical analysis that is expected to appear in your final

research paper. Part 3 will further explain how to write your draft and final papers.

12.1. Theory

Colour codes: You need to understand these notions for Assignment No. 1; You need

to understand these notions for Assignment No. 2; You need to understand these no-

tions for the final paper. Each topic is featured at various places in the Stata Guide, and

the theoretical notions are explained in depth in your handbook.

− Datasets: survey design, sampling strategy, sample size, units of observation,

variables, missing observations, categorical (ordinal, interval, nominal) and con-

tinuous (ratio, count) variables.

− Normal distribution: standard error (of the mean), skewness and kurtosis, prob-

ability distribution, (standardized) z-scores and ‘alpha’ levels of precision, other

distributions (binomial, Poisson).

− Univariate statistics: number of observations, mean, median, mode, range,

standard deviation, percentiles, quartiles, graphs (histograms, kernel density,

bar, dot and box plots).

− Estimation and inference: point estimates, confidence intervals, t distribution,

null and alternative hypothesis testing, p-values, one-sided and two-sided/-

tailed tests, Type I (false positive) and Type II (false negative) errors.

− Bivariate tests: t-tests (means comparison), proportions tests, Chi-squared tests

and Cramér’s V (independence or association), correlation (Pearson and Spear-

man), Gamma test.

− Regression: simple and multiple linear regression, unstandardized and standard-

ized ‘beta’ coefficients, F-statistic, t- and p-values, R-squared, dummies, interac-

tions, diagnostics (residuals, multicollinearity, homoscedasticity, influence).

− Statistical issues: sampling frames, survey weights, variable measurement and

bias, normality assumptions, correlational ≠ causal analysis, statistical ≠ substan-

tive significance.

− Scientific writing: research design, paper structure, scientific style, tables and

graph formatting, referencing (sources and citations), discussion.

!

119

12.2. Data

Data management is covered in Sections 5–8. Transforming variables needs to be done

carefully, and takes a lot of time, especially when you are new to the data that you are

analysing. Some general advice applies:

− Use the datasets that we recommend for this course. If you do not have a da-

taset ready well in advance for Assignment No. 1, fall back on ESS, GSS, WVS

and QOG data to find variables of interest.

− Your dependent variable is a single, continuous measurement, such as the av-

erage Body Mass Index of American adults or the percentage of women in na-

tional parliaments around the world.

− Your independent variables are possible predictors of your dependent variable.

Select a handful of ‘IVs’ of any type, such as age, gender, GDP per capita or

welfare regimes, that you think can explain the distribution of your ‘DV’.

− Never save over a clean dataset. Do not save your changes to the data! This

makes them impossible to retrace properly. Instead, post all information you

need to get replicated—rather than published—in the do-file.

Unabbreviated commands for these tasks:

− drop and keep (to select observations and variables)

− generate (to create new variables like sums or indices)

− rename and label variable (to rename your variables with convenient names

that are short and understandable, and to assign them labels)

− recode and replace (to recode data into simpler categories, such as dichotomous

variables with binary outcomes)

− encode and destring (to convert string data into numeric format)

− label define (to create value labels), followed by label values (to assign label to

the values taken by your variables)

Remember: poorly prepared datasets are much more complex to analyse, and even

harder to understand for others. As a rule of thumb, try to apply these principles:

− Renaming your variables to humanly understandable words or acronyms is very

helpful, as long as you keep the new names short and to the point;

− Proper labels should be assigned to both variables and values, so as to make

sense of categories, cut-off points and so on.

− Survey weights should be documented, even if we do not use extensively. Your

do-file should include the svyset command documented in the ‘readme’ text file

distributed with each course dataset.

!

120

12.3. Distributions

Start by describing your variables as sown in Section 9. Univariate descriptions of your

data will appear in your table of summary statistics. Some commands follow:

− tab, tab1 or fre (to tabulate categorical variables; the additional command fre is

recommended because of its better handling of missing observations)

− summarize (to produce a five-number summary for continuous data: number of

observations, mean, standard deviation, min and max)

− summarize with the detail option (to further summarize continuous data with

quartiles and percentiles as well as skewness and kurtosis)

− tabstat with the n mean sd min p25 p50 p75 max options (to further summa-

rize continuous data with quartiles); alternative command: univar.

Continue by graphing the distribution of your variables when you are able to comment

meaningfully on the distribution in your paper:

− graph hbox (to describe the distribution of continuous data, showing quartiles

and outliers)

− histogram with the kdensity and normal options (to describe the distribution of

continuous data; use other units if relevant)

− histogram with the discrete option (to describe the distribution of categorical

data that support an interval or ordinal scale)

− catplot (to describe the distribution of categorical data; this additional command

is rarely more useful than a frequency table obtained with tab1 or fre)

Remember: significant labels (usually percentages) will make your graphs much more

informative, so use xtitle and ytitle to assign titles to axes, ylabel to modify the units

and labelling of the y-axis, and specify frequency or percent to use these units of meas-

urement instead of density in histograms.

Finally, since independence tests are generally based on the presumption that your var-

iables are normally distributed, you should test this distributional assumption by looking

for possible transformations of your variable, using diagnostic plots:

− symplot (to assess symmetry)

− qnorm and pnorm (to assess normality at the tails and at the centre)

− ladder, gladder and qladder (to compare transformations)

The summarize command with the detail option provides skewness and kurtosis: zero

skewness denotes a symmetrical distribution, and normal kurtosis is close to 3. A varia-

ble might approach normality when transformed to its logarithm (log or ln), or some-

times its square root (sqrt) or square (^2). Only a full, visual check using the commands

above can determine the relevance of a transformation.

!

121

12.4. Association

Your paper is articulated around rejecting the null hypothesis about the absence of as-

sociation between your dependent and independent variables. There are many tests

available to do so, but you will use a short list of independence tests:

− ttest with the by option (to compare two means, using continuous data)

− prtest with the by option (to compare two proportions, using discrete data)

− tab with the chi2 option (to crosstabulate categorical data)

− tab with the exact option (to crosstabulate on low cell counts or ‘2 x 2’ tables)

− pwcorr with the obs, sig and star options (to build a matrix of significant corre-

lations at a certain level of significance)

The tests do not use the same method: comparison tests use confidence intervals, as

also provided by the ci or prop commands.

You should represent significant relationships graphically:

− sc with two continuous variables; use the mlab(country) option to label data

points with the values of the country variable, as you might want to do with

country-level data or other data with identifiable observations.

− spineplot for two categorical variables; make sure that you install this additional

command and train yourself to read it correctly, as it is arguably the most useful

way to plot two categorical variables against each other.

− gr dot or gr hbar for a continuous dependent variable and a categorical inde-

pendent variable; do not publish these graphs unless you have solid grounds to

think that a categorical plot can convey more information than a table.

12.5. Regression

You can use linear regression as long as your dependent variable is continuous. If your

dependent variable represents a binary outcome, another model applies, using the logit

command. The most useful commands for linear regression are:

− sc with two continuous variables (to visualize a possible linear correlation)

− tw (lfitci dv iv) (sc dv iv) with two continuous variables (to visualize correlations

with their linear fit and confidence intervals)

− reg dv iv1 iv2, … with 2+ continuous variables (to run simple or multiple linear

regression models)

− reg dv i.dummy (to add dummy categorical independent variables; add xi: in

front of the command if running Stata 10 or below)

!

122

− char dummy[omit] value (to set the baseline, or reference group, for dummy

categorical variables; a recode is usually simpler/better, though)

In your regression output, you should concentrate on reading:

− the number of observations (the data on which the model ran)

− the

t-

values and

p

-values (whether the model is statistically significant)

− the

R

-squared (a gross measure of the predictive value of the model)

− the coefficients (the amount of variance predicted by one unit of the predictor)

− the standard errors,

t

-values,

p

-values and confidence intervals (the reliability

of the coefficients produced by the model)

12.6. Programming

Remember some very basic Stata operating procedures:

− Setting the working directory is not an option. For this course, always check

that you have selected the SRQM folder as your working directory.

− Remember to install additional packages like fre, spineplot or tabout, or you

will run into errors when calling these commands.

− The commands of a do-file should be run in sequential order: if you try to exe-

cute line #30 before line #10, you are likely to encounter an error.

− Get used to running multiple commands at once: select them and use Ctrl-D on

Windows or Cmmd-Shift-D on Mac OS X to run them altogether.

The following tricks are for readers who have little or no experience with programming

languages. Skip them if you are acquainted to programming environments.

− To execute a command and ignore breaks in case of an error, use the capture

prefix command (shorthand cap), as in cap drop age (which will drop the age

variable and just do nothing if the variable does not exist).

This option will come in handy with lines such as cap log close, which will re-

turn an error if no log file is open. Using cap is then a safety net to ensure that

the do-file will run regardless of a log file being open.

− Run all adjacent commands connected with ‘///’ together. To break long

commands onto several lines, use /// or set a line delimiter with the #delimit, or

just #d, function; for instance, type #d ; to write in pseudo-C++ syntax.

The first option is very useful when you are dealing with code that extends be-

yond the limits of your do-file editor window. This will happen often if you are

coding graphs with many options.

− To apply a command to several variables with a wildcard operator, use the *

symbol, as in su trst* if you want to summarize all variables starting with the

!

123

trst prefix, or destring party*vote to convert all variables named party1vote,

party2vote, …

This trick comes in handy if you have created binary variables from a categorical

one. For instance, the tab relig, gen(relig_) command to create dummies of re-

ligious beliefs, the relig_1, relig_2,… relig_n variables can be designated togeth-

er by typing, for example, ci relig_*.

− To use loops, read the foreach and while documentation. Stata code is Turing

complete, so it handles your needs. Use them if you know how loops generally

function in programming environments; ignore them otherwise.

The simplest use of a foreach loop is when you are recoding a bunch of varia-

bles that all follow the same coding rules. In that case, you can place the recod-

ing operations in a foreach loop to save time and code.

− To use ‘macros’, read the local and global macros help pages. Use macros only

if you have sufficient programming experience in another language where you

already learned to use macros, constants or scalars.

A basic example of a global macro is one that stores recurrent graph options,

which you can then call in only one word. A basic example of a local macro in-

volves counters, which should not come up in this course.

Examples of each trick will show up in portions of the course do-files that are not

shown in class, but included for you to explore while replicating the session. None of

these tricks are part of the course requirements.

*

My tentative conclusion to the course follows.

For empirical social scientists, statistical reasoning and quantitative methods provide an

additional layer of theory and methods to their sociological skill set, which also includes

a fair share of history, philosophy and various forms of intuition.

This layer is meant to enhance their capacity to make causal claims about observable

relationships in the social, material world. Just like any compound of theory and meth-

ods, the ‘SRQM’ layer is a thick one that is not easily digested: in fact, you will have to

ruminate it a lot, in the same way that you should be feeding on data and secondary

analysis to ground your own work.

You can be an intellectual fox or hedgehog, as Isaiah Berlin once offered to, but if you

want to be an empiricist, you will have to be a ruminant too. Please do not be offended

when I say that this guide is actually a cookbook for hungry sociological herbivores.

And remember: your input to improve this guide is needed!

!

124

Part 3

Projects

The Assignments for this course will lead you towards submitting your final paper, in

which you will expose your research project. During the weeks of class, you will write

up two Assignments and one final paper, following instructions that are carefully de-

scribed in this section. Read these in detail.

The grades assigned to your Assignments should be read as a measure of your progress

towards producing your final paper. The grading ranges from 1 (critical revisions need-

ed) to 5 (cosmetic revisions needed).

The table below shows how Assignments progressively lead to your final paper.

Assignment 1

Assignment 2

1. Research question

Identify a research

topic, a dataset, and

select some variables.

Revise and update Assignment No. 1

before moving to steps 3 and 4.

2. Univariate tests

Identify the values

and distributions of

the variables used in

the research.

3. Bivariate tests

Crosstabulate the var-

iables and run signifi-

cance tests for the

crosstabs.

4. Regressions

Run regression mod-

els, report their out-

put, and discuss the

results.

!

125

13. Formatting

Section almost entirely consistent with the course in its current form. Review. Nota-

bly, Tables and Graphs could be presented in another division, “Exports” and

“Presentation”.

Applying the formatting conventions below will ensure that your paper uses a presenta-

tional standard that comes close to academic practice. Do not consider your paper an

academic one without some attention to presentation, which are necessary to help the

reader with understanding your work.

Intelligent formatting will require exporting, styling and commenting on tables and

graphs. This will make up for a substantial fraction of your word limit, and of your pa-

per overall. If followed, these recommendations will hence help you build your paper

methodically, which also helps with reasoning. QED.

13.1. Communication

When sending your Assignments, you must carefully respect some conventions that

apply to your email and attached files. Your Assignment emails should be structured as

in the following example:

From

Use your Sciences Po or Gmail email address.

To

Always email to both course instructors.

Subject

SRQM: Assignment No. 1, Briatte and Petev

Body

Insert your comments and questions about your work.

Attachments

Assignment

BriattePetev_1.pdf

Do-file

BriattePetev_1.do

Dataset (optional)

Attach your data in DTA format if you

chose to work with a dataset outside of

recommended ones.

13.2. Files

Your Assignment is a text file written following a set of scientific conventions. Follow

the instructions provided in class to format your document, and print it as a PDF file

using any common utility to do so.

!

126

Your do-file must be executable: try running them to make sure that they do not pro-

duce errors. Use the template provided in class to do so, and do not hesitate to imitate

the do-files from the course sessions.

Your dataset must be provided for replication purposes: if it does not feature in the da-

tasets recommended for the course, send it along as a Stata data file with a ‘.dta’ ex-

tension, converting it to that format if needed (Section 5), and making sure that it is

ready for cross-sectional analysis (Section 7).

Important: if your dataset ranges over ~ 5MB, compress it as a ZIP or RAR file, without

any form of password encryption. If your data still ranges over ~ 10MB after subsetting

and compression, use sendspace.com to email it over. Please do not use any other for-

mat or service, as to avoid unnecessary confusion.

13.3. Text

Your Assignment is not

a string of Stata output pasted into a text file. The last sections

of this guide (and Section 16 in particular) set out instructions to format your final pa-

per as a scientific one, but more generally, remember that academic writing involves

using precise, unambiguous terms in correct, simple sentences. Beyond issues of vocab-

ulary, grammar and syntax, you should also use a text structure made of a small num-

ber of balanced paragraphs and sections.

Data and procedures that come with no explanation are useless to the reader. In your

work, tabular and graphical data visualizations (tables and figures) should be supported

by substantive text, including a title, a legend and some explanatory notes, either in

your main text or as captions. More on this below.

In many ways, the same recommendations apply to the code in your do-file. As men-

tioned in Section 2, computer languages work a lot like human languages. For instance,

linguistic diversity also applies to programming languages: Stata code is only one of

thousands of different ones, each of which possess their own syntax rules and end up

forming ‘families’ of languages, with shared properties and a more or less dynamic

community of more or less proficient users, etc.

Code is fundamentally text, and computer code obeys common linguistic rules. For

example, you have noticed that Stata code supports abbreviations, with some com-

mands coming in two forms (a ‘full form,’ such as summarize, and a shorthand form,

such as su). Pushing that observation further implies that there such a thing as writing

‘clear’ code, just like there is ‘clear’ writing.

This notion is often called ‘literate programming’, as computer scientist Donald Knuth

termed in 1984. Knuth offered to “change our traditional attitude to the construction

of programs: Instead of imagining that our main task is to instruct a computer what to

do, let us concentrate rather on explaining to human beings what we want a computer

to do.”

!

127

Quoting again from Knuth, that mind-set carries one important lesson: “The practition-

er of literate programming can be regarded as an essayist, whose main concern is with

exposition and excellence of style.” Breaking it into simple rules that fit this course, the

three most important ideas of ‘literate programming’ that apply to your work and will

convey sense to your code go as such:

− Syntax and vocabulary are (almost) inflexible. Computer code is less versatile

than human language, and deviations in the syntax or terms that you use will

generate errors. Eliminating these errors are the first aspect of ‘debugging’ a

program, that is, correcting its content to make it behave properly. Your code

should be flawless by that standard.

Section 2.6 explains how to deal with a command that returns an error caused

by faulty syntax or by a spelling mistake. More complex aspects of debugging

involve, for example, the precise order in which you execute your commands.

Hopefully, the linear structure of Stata code should reduce issues in that catego-

ry to their minimum.

− Complexity calls for annotations. As soon as we express difficult thoughts, we

add a ‘meta’ layer of information to our text, as with side notes in religious texts

like the Talmud, stage directions in theatre plays, or bracketed information and

footnotes in other kinds of texts.

In computer code, comments serve precisely the same elucidative purpose. They

will be useful to external readers and will also help you to remember why you

entered each command. There is no need to add comments on every single

command that you use, and you should be able to distinguish which parts of

your code needs them.

− Sectioning is not an option. Cartesian thought has that in common with play-

writing that it uses blocks, like Acts and Scenes. Think of philosopher Ludwig

Wittgenstein’s Tractatus Logico-Philosophicus, which uses hierarchical number-

ing to outline seven propositions, or just think about the structure of any book

or newspaper, with page numbers, paragraph spacing, text columns, sub-heads

and running titles.

Your code should also feature a simple sectioning, with comments and blank

lines used to create blocks of code where commands are grouped in relevant

conceptual blocks, such as setup, recoding, descriptive statistics, analysis with a

first independent variable, and so on.

13.4. Tables

Quantitative evidence requires producing tables of summary statistics for many opera-

tions such as variable description, correlation or regression modelling. Stata output can

be exported by copying it from the Results window, using the ‘Copy as Picture’ func-

!

128

tion. This method might be acceptable for presenting draft texts, but not for final writ-

ten communication.

There are two main reasons to this, beyond aesthetic reasons:

− First, Stata output usually contains more information than required for a stand-

ard paper. Scientific communication is parsimonious and only requires a selec-

tion of summary statistics, while Stata is more exhaustive and sends larger

quantities of information for analytical purposes, to help you build your inter-

pretation.

− Second, Stata output reports a high level of precision by including many digits

after the decimal point, which is inconsistent with the limited precision of initial

measurements. For example, the height variable of the NHIS dataset is reported

by the su command as having a mean of 66.68131 inches, a precision level that

does not reflect real measures.

It is therefore expected that you do not copy and paste from your Stata output to re-

port your results, but instead convert it to tables that follow some standard formatting

conventions:

− Round all results to one or two decimals: After exporting your results to a

spreadsheet editor like Google Documents, Microsoft Excel or Open Office, use

a rounding function to truncate numbers.

− Ideally, format your table as to align columns to decimal tabs. Centring your

tables on the decimal separator “.” helps with reading your results. Unfortu-

nately, not all word processors manage to do this well.

− Ideally, provide notes with your table. Your table can use footnotes to com-

ment on its contents. These comments can either appear in your text, or better,

as footnotes immediately below the table.

Because word processors are variably competent or explicit about the two latter con-

ventions, they are only optional here. However, for the most dedicated, brief instruc-

tions appear in this guide: http://people.oregonstate.edu/~acock/tables/center.pdf.

Start by exporting your tables using the method described below, which deals first with

continuous variables, and then with categorical ones:

− First, install the tabout command, then run these commands by replacing dv,

iv1, iv2 and iv3 with your dependent and independent variables, as long as

they are continuous:

* Produce a standard summary statistics table.

tabstat dv iv1 iv2 iv3, s(n mean sd min max) c(s)

* Export to CSV file.

tabstatout dv iv1 iv2 iv3, tf(stats1.csv) s(n mean sd min max)

c(s) f(%9.2fc) replace

!

129

The CSV file, which will require that you import it in a spreadsheet editor like

Microsoft Excel, contains a table of summary statistics that can be easily import-

ed or copied and pasted into your text processor.

− Second, run this command on your categorical variables. The command will ex-

port a frequency table in percentages:

* Export to CSV file.

tabout iv4 iv5 iv6 using stats2.csv, replace

The files stats1.csv and stats2.csv should have been created in your working directory

(Section 3.3), and can be imported or copied and pasted into a word processor. Both

files are just working files and need not to be sent with your do-file.

A useful way to compact both files into one, in order to produce only one table, is to

stick the frequency percentages of your categorical variables into the same column as

the mean values of your continuous ones. An example of that arrangement is shown

below in Table 3.

The following example describes a dataset and its main variables of interest for its

units of observation, N = 10 African countries, based on a recent journal article: Kim Yi

Dionne, “The Role of Executive Time Horizons in State Response to AIDS in Africa”,

Comparative Political Studies 44(1): 55–77, 2011.

Table 1. Countries of analysis

Country

HIV prevalence

(2001)

GDP

(2001)

Ethiopia

4.1

724.80

Mozambique

12.1

1018.82

Rwanda

5.1

1182.49

Zambia

16.7

816.83

Tanzania

9

547.27

Burundi

6.2

618.10

Uganda

5.1

1336.33

Lesotho

29.6

2320.70

Namibia

21.3

6274.3

Kenya

8

1016.18

Note: inspired from Yi Dionne (2011).

The next examples describe the summary statistics for the Quality of Government da-

taset. By convention, “N” designates the total number of observations, and “SD” is the

acronym for the standard deviation.

!

130

Table 2. Summary statistics

Variable

N

mean

SD

min

median

max

Infant mortality rate (per 1000 live births)

181

43.30

39.97

2.80

27

166

GDP per capita (logged)a

192

84.67

715.84

0.01

1.93

6243.05

Government health expenditure (% GDP)

76

7.57

1.56

4.58

7.44

11.20

Note: data from Quality of Government (2010).

a Variable transformed to natural logarithmic scale.

If you need to include categorical variables, use the “Mean” column to indicate the val-

id percentages for each category, as follows:

Table 3. Summary statistics

Variable

N

mean

SD

min

median

max

Infant mortality rate (per 1000 live births)

181

43.30

39.97

2.80

27

166

Government health expenditure (% GDP)

76

84.67

715.84

0.01

1.93

6243.05

GDP per capita (logged) a

192

7.57

1.56

4.58

7.44

11.20

Regime type

Monarchy b

13

6.99

Military

12

6.45

One-party

7

3.76

Multi-party

56

30.11

Democracy

89

47.85

Note: data from Quality of Government (2010).

a Variable previously transformed to natural logarithmic scale on grounds of normality.

b Includes only nondemocratic monarchies; cf. QOG Codebook (2011), p. 34.

The next example describes crosstabular output. By convention, independent variables

are displayed in rows, with column percentages. The example below provides both fre-

quencies and percentages, but it is common not to indicate the frequencies when these

reach high counts.

As you will notice, the table is not very helpful, as it uses the recoded versions of two

continuous variables. When the data are available as continuous variables, scatterplots

are always preferable to crosstabulations, especially when displayed with the linear fit of

a simple linear regression.

!

131

Table 4. Crosstabulation of GDP and HIV

HIV prevalence

Low

Medium

High

Total

GDP per capita

n

%

n

%

n

%

n

%

Low

1

25.0

3

37.5

0

0.0

4

26.6

Medium

3

75.0

3

37.5

1

33.3

7

46.6

High

0

0.0

2

25.0

2

66.6

4

26.6

Total

4

100.0

8

100.0

3

100.0

15

100.0

Note: adapted from Yi Dionne (2011), replication dataset.

The final example describes multiple linear regression output. The important variables

to include are coefficients, standard errors, the constant (or intercept), the total number

of observations – or N – and the R-squared, as well as the starred p-values. The table

below reports multiple linear regression for three dependent variables and uses three

levels of significance (0.1, 0.05 and 0.01).

Table 4. Estimated Effects of HIV rates and GDP on API policy scores

API

Policy

Health

Spending

AIDS Spend-

ing

Log HIV prevalence

7.89

(10.03)

5.77

(4.25)

1.67

(1.28)

Log GDP per capita

-7.59

(8.04)

-2.51

(3.41)

1.97*

(0.97)

Constant (or “Intercept”)

96.97***

(21.75)

12.22

(9.22)

-21.48***

(2.7)

Observations (or just “N”)

15

14

R-squared (or fancily“R2”)

0.08

0.13

0.52

Note: adapted from Yi Dionne (2011), Table 3. Each column corresponds to a different model,

and starts with the name of the dependent variable. Standard errors in parentheses.

* p < .1. ** p < .05. *** p < .01.

As indicated above, it is good practice to add a table summary after your regression

output. The summary must report whether the results confirm or contradict the predic-

tions of the models (your hypotheses), as shown by the full table summary written by

the author:

Table summary: Contrary to what I hypothesized, longer time horizons are as-

sociated with lower values on the API policy and planning score, meaning less

AIDS intervention (API Policy). As predicted, longer time horizons are associated

with higher government expenditures on health (Health Spending). However,

no inferences can be made with this data about the role of executive time hori-

zons on domestic spending for AIDS programs (AIDS Spending).

!

132

The extracts I underlined relate to the author’s predictions. Her summary not only de-

scribes how her hypotheses did against the data, but also describes the parts of her re-

search that did not yield any significant results. Note that the table summary interprets

parts of the table that are not reproduced here.

13.5. Graphs

Section 9.1 describes how to add options to your graphs in order to make them more

readable. As a rule of thumb, include graphs only when they convey more information

than a well-formatted table.

Exported graphs should use either PNG or PDF format. Export can be performed in

more than one way, so read carefully:

− The simplest way to produce a graph in Stata is to run a graph command and

then save the results of the ‘Graph’ window by using the Save item of the File

menu. Copying and pasting your graphs from Stata to your text directly is not

good enough, as it does not create the actual graph file and skips the part

where you can choose its format.

This method is alright if your graph command appears in your do-file, and if the

graph is stored in Stata memory along the way. To do so, add the name() op-

tion with the replace sub-option to the graph commands in your do-file that

produce a graph included in your paper:

* Histogram of the dependent variable, with saving options.

hist dv, freq normal name(histogram_dv, replace)

− The safest way to export graphs is to include the graph export command in-

stead of copying and pasting, as to create graphs on the fly, when your do-file

is running. The example below illustrates the name(, replace) and graph export

commands:

* Histogram of the dependent variable, with saving options.

hist dv, freq normal name(histogram_dv, replace)

* Exporting histogram_dv.

gr export histogram_dv.png, replace

Basic formatting rules then apply:

− Be parsimonious. Your do-file will produce more graphs than you will end up

including in your final paper: most of these graphs are used for visual explora-

tion, but need not be included in the final stage of analysis. Section 16.2 re-

states that point.

− Explain your graph. Do not consider your job done until the graph has a title,

possibly a caption (as with the footnotes in a table), and a clear reference point

in your text that cites your graph as either “Figure 1” or “Fig. 1” (use any, con-

sistently), and explains your graphical results.

!

133

14. Assignment No. 1

Section almost entirely consistent with the course in its current form. Use the example

and template files as well as the course slides for more exhaustive instructions.

Please make sure that you read this instruction sheet in full. The grade for this As-

signment will refer to every instruction to assess your capacity to work in a quantitative

environment, including both stylistic and substantive issues. Also read Section 13 on

formatting your work before starting this Assignment.

Why work hard, if at all, on your Assignment?

− Take it as a challenge. Unlike jazz music, quantitative data are very much inert

by nature and will require that you put some life into them. Explore, tabulate,

describe it… Take possession of your data!

− Something important is happening. Quantitative data are currently affecting

both our vision of society and the core tenets of social science. Join the scientific

revolution!

− Your work is cumulative. You will be able to use this Assignment to write up

your final paper, which underlines the absolute need for regular work and prac-

tice during this course. Believe me: other methods will fail.

Welcome to the world of quantitative methods, and good luck!

14.1. Research design

Start by describing your research question and hypotheses, using approximately 15

lines of text. While this step would require consulting the scientific literature on your

research topic as to derive your hypotheses (or predictions) from the findings of previ-

ous studies, you do not have to produce a literature review for this course, and should

instead use your acquired knowledge of the topic as well as your intuitive predictions.

Here are a few examples of potential topics, in addition to those sent by email or men-

tioned in class:

− Differences across time and/or countries in attitudes toward: religion, inequali-

ty, homosexuality, marriage, immigration… The European Values Survey and

the World Values Survey hold a large sample of questions on these themes and

on many others.

Use your general knowledge and curiosity to come up with questions. Who are

the individuals who declare being optimistic about their future? How does in-

come and health influence other aspects of wellbeing? Is public opinion split on

topics such as climate change or euthanasia?

!

134

− Changes in political factors such as party identification and left/right political

cleavages, political regimes, voting systems… The political science literature

holds virtually thousands of ideas in that domain. Start by checking the ICPSR

data repository on such topics.

Remember that polling data is produced for many political events and policies.

Who supported the invasion of Iraq, and who would support military interven-

tion in Iran to stop nuclear proliferation? Is there public support for the use of

torture when interrogating terrorists?

− Social determinants of income inequality, poverty and crime rates, health, edu-

cational attainment and so on, at a geographic scale for which data are availa-

ble. The European Social Survey documents some of these aspects for Europe,

but there are thousands of datasets available.

As an example, think of how some issues are unequally distributed among age

groups and between men and women, such as drug abuse, career advance-

ment, homicide, geographic mobility, traffic injuries, unauthorised digital file-

sharing, depression, alcohol consumption, etc.

− Observed effects, both positive and negative, of entry in the European Union or

transition to democracy, or… on demographics, life expectancy, economic per-

formance, crime rates, green technology, social mobility… Evaluations and ex-

periments are conducted on all sorts of events.

To take a few examples, fascinating academic studies have documented the ef-

fects of the Cultural Revolution on mental stress and cancer rates in China, on

the effects of class size on educational performance, or on the benefits and

costs of public-private partnerships.

− Be imaginative! For example, Jane Austen wrote in Pride and Prejudice (1813):

“It is a truth universally acknowledged, that a single man in possession of a

good fortune, must be in want of a wife.” To what extent was Jane Austen

right? (Thomas Hobbes is also a great source of research questions, but they

tend to be much more pessimistic about human nature than Jane Austen was.)

Again, be imaginative: your personal interest and ideas are crucial to the task. Data are

collected and made available at all levels of society, from municipal districts in urban

studies to the whole planet in international relations, on all sorts of topics. Try to review

a maximum of options, but know your limits: some (interesting) questions are out of

our range of skills for this course. Do not use time series data, and select a continuous

or ordinal dependent variable.

A real-life example of model description goes like this. Take a close look at the scientific

style of the description, especially when it comes to the description of predictions and

variables, since you will be borrowing from that style of writing in your own Assign-

ment:

!

135

The dependent variable in the model is the size of government. The concept is

measured, following the example of several previous studies (e.g., Alesina &

Perotti, 1999; Poterba & von Hagen, 1999), by total government outlays as a

percentage of GDP. The measure has considerable face value. However, to test

the robustness of the findings, a second model will be estimated using total rev-

enue of all levels of government as a percentage of GDP (Cameron, 1978; Hu-

ber et al., 1993) as the dependent variable.

… However, there are also several control variables that need to be included in

the study. It has been argued that the size of the public economy of a country is

determined by its economic openness (see Alesina & Wacziarg, 1998; Cameron,

1978; Rogowski, 1987). A similar statement has also been made in reference to

the size of the welfare state (Katzenstein, 1985). The logic behind the argument

is the following: If more open countries are more vulnerable to exogenous

shocks such as shifts in their terms of trade with world markets and if govern-

ment spending is capable of stabilizing income and consumption, then more

open countries will need a larger government to play a stabilizing role. The eco-

nomic openness control variable is measured by the ratio of imports and exports

to the GDP. Institutional models of the size of the public economy have also

stressed the impact of the federal, institutional structure of government (Cam-

eron, 1978; Schmidt, 1996)… Therefore, one might expect nations with a fed-

eral structure of government to have larger public economies than countries

with a unitary structure. Linked to this explanation is another aspect of the insti-

tutional structure of government: the degree of fiscal decentralization (Camer-

on, 1978). In accord with the previous statement, the relatively decentralized

nations should have a larger scope of public economy… In any event, despite

the confusion about the direction of the association, federalism has been identi-

fied as an important explanatory variable for government size. Lijphart (1999)

created an index measure of federalism and fiscal decentralization ranging from

1 to 5. This measure will be included in the analysis as a control for the effects

of government structure.

Source: Margit Tavits, “The Size of Government In Majoritarian And Consensus

Democracies”, Comparative Political Studies 37(3): 340–359, 2004 (footnotes

removed, highlighted text added).

You should have noticed that the hypotheses are often directional predictions, which

are often written as positive relationships such as: “if x increases, y is also likely to in-

crease, and conversely”, or as negative relationships, as in: “x and y are expected to be

inversely related, with y being likely to decrease when x increases, and conversely”. The

course slides hold advice on hypothesis writing.

!

136

14.2. Dataset description

Add to your Assignment a description of the study and dataset that you will be using to

address your research question. For example, the sampling strategy and variable de-

scription for the NHIS data that we use to inspect the Body Mass Index of U.S. resi-

dents would go as such:

The National Health Interview Survey (NHIS) is a multipurpose health survey

conducted in the United States by the National Center for Health Statistics,

which is part of the Centers for Disease Control and Prevention (CDC).1 It uses

a multi-stage probability sample that includes stratification, clustering and over-

sampling of racial/ethnic minority groups; it forms a representative sample of ci-

vilian, non-institutionalized populations living in the United States, and its total

sample population is composed of 251,589 individuals from the 2000–2009

survey years, which was reduced to N = 24,291 by subsetting the data from the

2009 survey year.

The dependent variable will be the Body Mass Index (BMI) of the respondents,

which we constructed from available measures for weight and height (variable

bmi). We also recoded the BMI variable into its seven official categories, which

range from severely underweight to morbidly obese (variable bmi7).

The independent variables are sex, age, education level, health status, physical

exercise (vigorous and leisurely activity), race and health insurance status. We

expect to find higher average levels of BMI for males, for older and less educat-

ed people, as well as for racial groups that are either less educated or less likely

to be covered by health insurance, which is why we included race as a control

variable. We expect BMI to decrease with wealth, education and frequent phys-

ical activity.

1 Source URL: http://www.cdc.gov/nchs/nhis.htm.

As mentioned in class, data discovery is a skill in itself, and a time-consuming task

moreover. Identifying and fully describing a dataset is never a question of a few

minutes: you will need at least a couple of hours to locate, download and explore a se-

lection of datasets in order to make your final choice.

If you have absolutely no previous experience with quantitative analysis, you should

begin with the European Social Survey (ESS) if you are interested in measuring social

attitudes. If you are more interested in country-level data on political and economic

topics, the recommended datasets is the Quality of Government (QOG) dataset.

Include some critical perspective on the data. Your data are limited in precision, due to

issues of measurement: for instance, gross domestic product (GDP) is a measurement

(or proxy) of national wealth that does not reflect inequalities in income concentration.

!

137

Briefly document any issues with variables and variable measurement in a few lines, af-

ter reading from the codebook.

Check how your dataset is constructed. Typically, your dataset should hold homoge-

nous units of observations in rows, variables in columns. It should also contain only

one year of cross-sectional data. Use the subsetting procedures described in Section 7

if your dataset contains data over several years.

Your description should cover the sampling strategy used by the survey, its unit of

analysis and the size of the sample (its total number of observations). Additionally, you

should provide the full source for your data (authors, URL…). All in all, your dataset

description should not range over 10 lines of text. Writing these lines will require that

you spend some time reading from the documentation that comes with your dataset.

A real-life example of dataset description goes like this. Again, take a close look at the

scientific style of the description:

To investigate the sources of ethnic identification in Africa, we employ data col-

lected in rounds 1, 1.5, and 2 of the Afrobarometer, a multicountry survey pro-

ject that employs standardized questionnaires to probe citizens’ attitudes in new

African democracies. The surveys we employ were administered between 1999

and 2004. Nationally representative samples were drawn through a multistage

stratified, clustered sampling procedure, with sample sizes sufficient to yield a

margin of sampling error of ±3 percentage points at the 95% confidence level.

Our data consist of 35,505 responses from 22 separate survey rounds conduct-

ed in 10 countries: Botswana, Malawi, Mali, Namibia, Nigeria, South Africa,

Tanzania, Uganda, Zambia, and Zimbabwe.

Source: Ben Eiffert, Edward Miguel and Daniel N. Posner, “Political Competition

and Ethnic Identification in Africa”, American Journal of Political Science 54(2):

494–510, 2010 (footnotes removed, highlighted text added).

14.3. Variable description

Before starting your variable description, make sure that your dataset holds at least 30

valid (non-missing) observations for your independent and dependent variables, other-

wise you will need to identify other variables to run a statistically robust analysis. Use

the su and fre commands to run these checks.

Start by describing your dependent variable (the variable that your research will aim at

explaining) in detail. Your dependent variable must be either continuous or ordinal (or

binary if you can handle a bit more theory in later sessions). Check with the codebook

of the study for the exact wording of the question if it comes from a social survey, or

check for the definition of the indicator if it comes from a country-level study. If the

variable is measured on an ordinal scale, describe the range of possible values.

!

138

Always check the precise coding of your variables. For example, if you have selected

an ordinal variable that is coded in reverse scores, such as 1 for “Best” and 5 for

“Worst”, you can install and use the revrs command to reverse it into an intuitive scale.

Also make sure that you how missing values are coded for each of your variables (the

natural Stata coding will be “.”).

A real-life example of dependent variable description goes like this. Again, take a close

look at the scientific style of the description:

The main dependent variable we employ comes from a standard question de-

signed to gauge the salience for respondents of different group identifications.

The question wording [is] as follows:

“We have spoken to many [people in this country, country X] and they

have all described themselves in different ways. Some people describe

themselves in terms of their language, religion, race, and others describe

themselves in economic terms, such as working class, middle class, or a

farmer. Besides being [a citizen of X], which specific group do you feel

you belong to first and foremost?”

As noted, a major advantage of the way this question was constructed is that it

allows multiple answers and thus permits us to isolate the factors that are asso-

ciated with attachments to different dimensions of social identity. We group re-

spondents’ answers into five categories: ethnic, religion, class/occupation, gen-

der, and “other.”

Note: in this example (taken from the same source as above), the dependent variable is

a nominal variable, which is strictly categorical and not continuous. For this course, you

should not use categorical variables, but rather focus on continuous or on ordinal varia-

bles (which we will treat as continuous).

Continue your dependent, continuous variable description by describing its distribution

and potentially transforming it to a more normal distribution, using the procedures

shown in class and covered in Section 9.

Finish by briefly describing your independent variables, using tables of summary statis-

tics (see Section 9). No graphs should be required for these variables.

14.4. Programming

Your first Assignment must come with a do-file. You will have to write the do-file and

then run it (execute it) to produce a log file, as described in Section 3 and Section 3.6 in

particular.

!

139

Your do-file should not contain any errors: it should run until its end without stopping

(breaking) because of a mistake in your code. This will require that you test your do-file

multiple times to debug it (correct mistakes).

14.5. Reminders

− Do not panic. Work regularly, and you should be fine. It is foreseeable that

some aspects of either statistical analysis, quantitative methods or Stata proce-

dures will get you lost, especially if you are learning about these topics for the

first time. If you work every week along the basic course schedule described in

Section 1, you will be fine.

− Always read the replication sets from each course session, as provided on the

course website. The course do-files show you every single procedure that you

might need to use for your own research, which means that the answers to your

problems are probably just a few clicks away from where you are standing right

now.

− Get the formats right. Your email should contain all your files, and should be

sent along the instructions mentioned in Section 13. Following these instructions

really is a standalone skill. Basic psychology teaches that grading instructors like

it a lot when students follow all instructions, and the reverse statement is likely

to be also true.

Again, good luck, and see you soon!

!

140

15. Assignment No. 2

Section almost entirely consistent with the course in its current form. Use the example

and template files as well as the course slides for more exhaustive instructions.

Please make sure that you read this instruction sheet in full. The grade for this As-

signment will refer to every instruction in order to assess your research skills in a quanti-

tative environment, including both stylistic and substantive issues. Also read Section 13

on formatting your work before starting this Assignment.

15.1. Corrections

Assignment No. 1 was composed of a text file and of a do-file, and was the first draft

that you submitted towards your final paper. Assignment No. 2 follows the same logic

and is as much about extending your analysis as it is about improving your first draft.

Assignments in this course are not standalone test grades: they are cumulative writings

that monitor your advancement with your project.

Assignment No. 2 builds on your previous Assignment and will bring you just one step

from writing up your final paper. Before doing so, it is essential that you fully revise

Assignment No. 1 before ‘upgrading it’ to Assignment No. 2. Please refer to Section

14 and to the feedback on your Assignment to make sure that you have done so.

Here are some common mistakes that often appear in early do-files, and which you

should immediately correct:

− Analysing ESS data without survey weights. If your research design relies on

data from the European Social Survey, you will need to weight the data as indi-

cated in its documentation.

Simply put, you should insert the svyset [pw=wgt] command immediately after

loading your data with the use command. If your research design covers all Eu-

ropean countries, then you need to generate a product of design and popula-

tion weights to weight respondents properly: insert gen wgt=dweight*pweight

before the svyset command.

If you are working on only one country, or if population (country) weights are

irrelevant to your research design, simply replace wgt by dweight in the svyset

command above and ignore the gen command.

− Using fre for continuous variables while you should be using su instead. The

most common mistake at the level of descriptive statistics is to use the fre com-

mand when the summarize (su) command is appropriate.

!

141

If you are using fre to count valid observations and missing data, you should be

using the codebook command with the c option instead, which also gives you

summary statistics and variable labels, as in this example:

Note that in this example, the mean value of the gndr variable is irrelevant, as it

was computed from arbitrary values assigned to gender in this categorical varia-

ble.

If you are drowning in more serious problems with missing data than just a few

observations to drop, you should turn back to recoding your missing values, as

explained in Section 8.2.

Finally, do not forget about the count command and the if mi() or if !mi() logi-

cal operators. Those often come in handy when you are thinking about select-

ing variables for crosstabulation or association tests.

− Assignment No. 1 should be close to three pages. Assignment No. 2 will add to

them, but before starting, you need to check whether your data and variable

descriptions are concise enough. Figures and tables might have pushed you

slightly over, up to four pages or even five pages if you chose a large font or

have a long table of summary statistics. But there is hardly any reason to push

Assignment No. 1 over five pages.

Limit figures and tables in Assignment No. 1 by using a single table for sum-

mary statistics, and a single graph for the histogram of your dependent varia-

ble: you are not expected to include other graphs like variable transformations

from the gladder command at the level of descriptive statistics. Assignment No.

2 will add more figures and tables.

Copy-pasted Stata output does not count as text. Your text should not include

Stata commands, and your tables should not be Stata output copied and pasted

in your text. Section 13.4 explains how to format summary tables, and the

tabout, tabstat and mat2txt commands shown in the template do-file for As-

signment No. 1 will export your results for formatting with Microsoft Office.

− The following paragraphs summarize what your corrected version of Assign-

ment No. 1 should be telling its reader(s):

1. Introduction: (i) one or two paragraphs to state clearly the research question,

explain its relevance to the general public and with respect to previous literature

on the subject; (ii) one paragraph or two to formulate your argument in terms

of clear and testable hypotheses along with an explanation/justification for

what you predict.

gndr 51123 2 1.541518 1 2 Gender

agea 50996 87 47.57369 15 123 Age of respondent, calculated

Variable Obs Unique Mean Min Max Label

. codebook agea gndr, c

!

142

2. Data: one paragraph that describes the dataset used in your study, describes

and justifies your choice of countries to compare, and mentions the final sample

size after deletion of missing cases.

3. Variables: (i) one paragraph that describes your dependent variable in terms

of its summary statistics –mean, standard deviation, median, minimum, maxi-

mum– and its distribution, shown by a histogram; (ii) one paragraph or two to

cite and explain the relevance to your research question and hypotheses of your

choice of independent variables along with a description of their distribution us-

ing either proportions–if the variables are categorical (binary, nominal or ordi-

nal)–or summary statistics–if the variables are truly or sufficiently pseudo-

continuous.

4. Analysis: one paragraph for every separate association between your DV and

each of your IVs. This section is developed in Assignment No. 2.

5. Conclusion: one or two paragraphs that summarize your results with regard

to your general argument and research question.

15.2. Association

In this Assignment, you test for independence between your dependent variable and

each of your independent variables. The objective is to identify which independent var-

iables are worth keeping for the final regression analysis. The rule of thumb is to keep

only variables that have a statistically significant association with your dependent varia-

ble.

To that end, you need to review course material on bivariate tests: replicate each ses-

sion using the do-files, read through the course handbook chapters and slides, and read

Section 10 as well as various parts of Section 11 for correlation and simple linear regres-

sion.

As you will see, there are different types of tests of bivariate association: chi-squared

tests, t-tests, correlation and simple linear regression. The choice of the right test de-

pends on the type of the variables. There are three basic cases:

− When both variables are categorical, use the Chi-squared test. Produce a table

with column or row percentages, comment on the relationship between the var-

iables, and interpret the statistical significance (p-value) of the Chi-squared sta-

tistic.

− When one variable is continuous and the other is categorical, use the t-test. To

that purpose, the categorical variable needs to be recoded into a binary (0/1)

variable in order to compare the mean value of the continuous variable in each

group. Comment on the differences in means by interpreting the p-value for the

null and each directional hypothesis.

!

143

− When both variables are continuous, use a simple linear regression. After look-

ing at the number of observations used in the model, interpret the F-statistic, its

p-value and the R-squared. This first step establishes how much statistical pow-

er and fit your model provides.

When you are done understanding the overall fit of your model, turn to the in-

dependent variable: identify significant coefficients by looking at standard er-

rors, t-values and p-values, and read their direction and magnitude, as covered

in Section 11.

Note the following special cases:

− The Chi-squared test requires a minimal number of cells counts: if your cross-

tabulation shows a table where some cells fall below 5 observations, you should

use the Fisher’s exact test instead, as you should with ‘2 x 2’ contingency tables,

regardless of the cell counts. Fisher’s exact test will be computationally more

correct in both cases. Its test statistic reads as if it were a p-value.

− Interval or ordinal categories offer you a choice of strategies: you can treat

them either as categorical or continuous data. However, if you decide to treat

them as continuous and wish to measure the association of that variable with

another continuous variable, you need to first make sure that there is a linear

relationship between the two variables. To check this, you need to display the

relationship using a scatter plot. If it shows an approximately linear relationship,

you can then use a simple linear regression.

Interpret, usually in two or three sentences at most, each of the tables where you de-

tected a statistically significant relationship between two variables. Report relevant sta-

tistics in brackets within your sentences, such as the p-value for a Chi-squared test,

Fisher’s exact statistic (which reads as a p-value itself), or the p-value of a t-test or a

proportions test. When reporting correlations, report Pearson’s r and its p-value; do not

forget to report the intensity and direction of the correlation. If you are dealing with

several statistically significant correlations within your choice of variables, use a correla-

tion matrix to present them.

In your do-file, you have to test all your independent variables, but you do not have

to produce either tables or graphs in your Assignment for cases where the null hypoth-

esis was retained (when no association can be identified under your alpha level of sig-

nificance). Identically, you should be selective when testing for interactions between

independent variables: produce these tests only if there is a substantive justification to

do so, as when you control for age while testing the association between an independ-

ent – explanatory – variable like household income and a dependent variable like the

number of children in the household.

Remember to discuss the statistical significance and substantive importance of all bi-

variate associations. If you are comparing the behaviour of your DV across countries,

regions or socio-demographic groups, then you should focus here on discussing differ-

!

144

ences across those groups. Use as many separate tables as needed for crosstabulation

and, if useful, use figures to illustrate your results.

15.3. Regression

Assignment No. 2 also covers simple linear regression, which comes as a natural com-

plement to correlation. Section 11 covers both simple and multiple linear regression, but

you should limit yourself to simple linear regressions with only two variables at play for

this Assignment.

Contrarily to the methods we used while covering association and correlation, regres-

sion goes beyond merely detecting relationships between variables: it is a modelling

technique that provides estimations of the model parameters. When reporting on the

regression coefficients and intercept, you should provide your own interpretation of

that model, as illustrated in class.

Including a scatterplot showing the linear fit between your dependent and independ-

ent variables can serve to display your simple linear regression. The graph should not

be used as mere illustration: it refines the analysis by providing an informative visualiza-

tion of the relationship between your variables, from which you can extract additional

observations: is the relationship truly linear, or is it curvilinear like an exponential (quad-

ratic) relationship? Does the scatterplot reveal visually identifiable outliers, what obser-

vations do they stand for, and can you explain why they deviate so much from the lin-

ear fit? (A classical example is Luxembourg, which always stands out as soon as GDP

per capita is involved as an independent or dependent variable, because its residents

are outstandingly wealthy in comparison to virtually any other country.)

15.4. Reminders

Here are a few reminders as to how you should be organising your work. Most of it will

sound like old news to many of you.

− Replicate, replicate, replicate the course sessions. To minimize the risk of using

the wrong command, test or graph, the easiest thing is to replicate the last

course session every week. The course website provides every single file used in

class, so that you can run the do-files again.

− Your do-file should be replicable. When grading, we must be able to run your

analysis again, by running (executing) the do-file. This implies that you send a

do-file that contains no errors, along with your original dataset, just as the

course website provides both so that you can replicate course sessions at home.

− Write as if you are writing a research paper. Your Assignment is the draft for

your final paper: it should be written as an ‘advance copy’ of it, with correct

English, full sentences, and clear explanations about what you are doing, what

you are finding, and what you think about it.

!

145

If you follow these instructions, bits and pieces of your paper should read as this com-

pletely fictional example, which illustrates how to translate a battery of tests into tables,

figures and, most importantly, interpretations:

The association between income and age groups, which reported a statistically

significant relationship with the Chi-squared test (p < 0.01), was also observable

when using continuous measures of both variables. Specifically, the correlation

matrix (Table 1) reports a strong, positive correlation (r = 0.45, p < 0.05) that

confirms their interplay within the sample population. Furthermore, as shown in

Figure 1, their relationship is quasi-linear, except at the highest values of both,

where the linear fit is slightly less truthful to the data. The regression coefficient

can be understood as follows: from the age of 25 onwards (the minimum age

of the respondents in our sample), income, starting at approximately at $22,000

per year, increases quasi-linearly by $700 each year on average, which reflects

the effect of career advancement and enhanced employment opportunities on

wages, as well as other factors that might have to do with capital accumulation.

When you are done with this Assignment, the last step towards your final paper will

consist in building a multiple regression model and a final interpretation of your re-

search question, all in the form of a standard research paper. Good luck, and see you

soon!

!

146

16. Final paper

Section still in draft form, but a good restructuring of all assignment sections will help

to make clear what should be mentioned here.

Your final paper is the finish line, the ultimate point, the very last episode of that epic

quest of yours. Fortunately, there is no dragon to beat. However, there is a paper to

write: if your last draft does not already read like a draft paper, you might have several

hours of work ahead.

In many ways, you have already cut off many heads off the Stata hydra: by finding and

preparing your dataset, by producing descriptive statistics, by running association tests

and by thinking, again and again, about your research design. Finally, the one last step

that has definitely revealed the worth of your work has consisted in running a linear

regression model that has assessed the predictive value of your independent variables

over your dependent variable.

Your final paper is primarily a reorganization of your work, which means that you will,

again, be revising previous assignments in order to suppress any potential ambiguity in

your wording or ideas, add what you might have omitted on first submission, and cor-

rect any mistakes that was flagged so far.

Rewriting will represent roughly 75% of your work on your final paper (increase that

estimate if your previous assignments came back with a grade below 4 and/or lots of

instructions for revision). The last 25% consist in checking your do-file carefully, reor-

ganising it, producing a log file and sending them all by email, as with your drafts.

16.1. Structure

Your paper follows a scientific style of writing as well as a scientific breakdown of

sections. Read this section even if you have some experience with writing up under

these conventions, and if you do not, read it with extra care, as it will turn out useful

not only for this course but also for many others.

First, your paper needs to be written in scientific style, which means that the writing

will be as simple as possible and only as complex as necessary. Some additional guide-

lines apply:

− Because the paper reflects what you know and did, do not use any term or ar-

gument that you cannot explain yourself. For instance, your paper will not ref-

erence “a sensitivity analysis of cluster sampling over high-resolution data”.

− Without exception, reference every single item of your paper that you did not

create yourself. This applies to arguments, observations, and also to data: your

dataset needs to be fully referenced. The online source of your dataset will usu-

ally give an example citation for it.

!

147

− Just as with any academic work, your paper is expected to have been carefully

proofread, up to the point where the read should not detect more than one oc-

casional spelling mistake on every page or so. Spell-check your paper and check

your sources for names, acronyms, etc.

Second, your paper should follow an outline analogous to the ‘IMRAD’ model, which

can be adapted to this course as follows (note that the example extracts are fictional

and do not represent any real study; real examples are provided later on in this instruc-

tion sheet).

− Your Introduction spells out your research question, outlines the variables of in-

terest, and offers your hypotheses (from Assignment No. 1).

e.g. “I study the relationship between extreme-right voting and socioeconomic

status (SES), as measured through occupation, income and educational attain-

ment. I also control for age, gender, ethnic origin and religious beliefs.

My hypothesis states that extreme-right voters sit at the bottom of social hierar-

chies within their age and gender groups, and will therefore score lower on all

measurements than other members of the social categories to which they be-

long to.”

− Your Methods cover your data and variables. The actual method of your paper

is ordinary least squares, or OLS, multiple linear regression.

e.g. “The study uses the last edition of the British Election Study, a survey con-

ducted by […] in May 2010. The dataset, which is available at […], contains

[…] adult respondents. The data were collected through face-to-face interviews

and the method of sampling was […].

The data were searched for significant correlations, which were then explored

through simple and multiple regression analysis in order to identify linear rela-

tionships between the variables of interest.”

− Your Results report your independence tests (from Assignment No. 2) and the

results of your linear regression model.

e.g. “Extreme-right voting does not concern a majority of the population: as

Figure 1 shows, only a small fraction of British voters declared voting for any of

the extreme right political parties.

After observing a significant correlation between […] and […], we can state

with confidence that extreme-right voting is higher in lower income groups.

Figure 2 below plots this relationship for each gender group.

In parallel, our regression of political participation against income also shows

that lower income groups participate significantly less in elections. The results of

that regression are reported in Table 1. The high R-squared (.56) suggests that

income is a major factor at play here.”

!

148

− Your Discussion concludes on your project, and includes criticism of both your

data and your predictions.

e.g. “Although the project succeeded at showing that income and educational

attainment are predictors of extreme-right voting, the weak association sug-

gests that other important factors come into play in explaining this relationship.

Furthermore, our hypothesis that religious behaviour would have a significant

impact on voting was not confirmed by our analysis. The small sample size for

our independent variables measuring religiosity limited the significance of our

tests and constitutes an important drawback of our study.”

16.2. Limits

− Paper. Your research paper should fit on a maximum of 10–12 pages, using the

standard format defined during our last course session. There is no length limit

for the number of lines of code in your do-file, but anything outside of the 100–

400 range will probably indicate something strange.

− Graphs. Include only relevant graphs that help to understand the relationships

mentioned in your text. Choose the type of graph carefully, as explained in class

and in several sections of this guide.

The feedback on Assignment No. 2 will usually include some notes on which

graphs to include, but the simplest way to know whether or not to include a

graph in your final paper is to judge whether it brings anything of value to the

rest of your analysis; if the answer is anything but ‘absolutely yes’, do not in-

clude the graph.

− Tables. Do not include tables except for your most significant outputs: summary

statistics, correlation matrix, and regression models. For other significance tests,

report the results (and especially the p-value) directly in your text. Presenting

and exporting tables is covered in Section 13.4.

16.3. Example

The following extracts are taken from a recent working paper published by the United

Nations Development Programme, which is used to illustrate what a research paper us-

ing quantitative data and methods should contain.

These are the opening lines of the text:

Introduction1

This paper examines the variation across countries and evolution over time of

life expectancy.

!

149

The opening section examines the impact of national income, measured as GDP

per capita in PPP, in Preston and augmented Preston regressions. Rather than

focus only on recent cross-sections since 1970 or so we use the available histori-

cal data going back to the beginning of the 20th century (the data are taken

from the series created for the GAPMINDER application and are described in the

data appendix). This long-run focus allows us to establish several basic facts

about the relationship.

…

1 Many thanks to comments from…

Source: Lant Pritchett and Martina Viarengo, “Explaining the Cross-National Time Se-

ries Variation in Life Expectancy: Income, Women’s Education, Shifts, and What Else?”

UNDP Research Paper 31, October 2010.

The authors immediately inform the reader about the dependent variable, life expec-

tancy, and then submit the first independent variable, income, and its proxy (the means

of measurement for it), which here is GDP per capita in PPP.

The method of analysis – some form of regression – is also mentioned, and the data are

described by providing the source and the time period covered. It is good practice to

store a detailed description of the data sources in an appendix, which the authors do

(see pages 60–63 of their paper).

The first footnote acknowledges some colleagues for their help with writing the paper:

if you have received help from anyone else than the course instructors, including other

students from the class or people who you might have emailed about accessing your

dataset, you should acknowledge their help.

The text continues as such:

First, there has been a strong cross-national relationship between income and

life expectancy for as far back as one can take the data. In the simple double

natural log Preston curve (life expectancy regressed on GDP per capita) the R-

squared for the 21 countries with data was as high as .8 as early as 1927 and

was at that level through the pre-World War II period. The modern data sets

with over 150 countries begin in 1952 and have availability every five years and

in that data there has been a high and rising R-squared roughly ever since (once

one controls for the AIDs affected countries).

This paragraph should be almost fully understandable now that you have completed

the course. It mentions the sample size, the variables used by the authors in their hy-

pothesis test (life expectancy and GDP per capita), as well as the shape of the distribu-

tion revealed by this hypothesis (a natural logarithm).

The extract also show that important aspects of your analysis should be mentioned di-

rectly in the text, like the R-squared or the control variables (here, the authors set aside

the units of observation – countries – with high infection rates of HIV/AIDS).

!

150

The only part that requires further explanation is the Preston curve, which is a classic

finding by Samuel H. Preston: when regressed onto GDP per capita for cross-sectional

data, life expectancy follows a natural logarithmic distribution. Learn more on Wikipe-

dia: http://en.wikipedia.org/wiki/Preston_curve.

16.4. Reminders

− Normalise your files and emails. Files sent without normalisation at this stage

run the unacceptable risk of either delaying the grading process or (even worse)

getting lost in dozens of other emails. You have almost all done very well with

this, for which you earn infinite gratitude from virtually every grader in the

world; please do so one last time.

− Unlike deadlines for midterm assignments, the deadline for the final paper is

completely intangible, as it corresponds to the last days before which grading

can be performed in acceptable conditions for formal submission of the grades

to the Sciences Po administrative units. Late work will therefore be dismissed.

The deadline will appear in a class email along with additional guidance.

Good luck, and well done!

We wish you the best of luck in all your future endeavours. Please submit some feed-

back on the course, and let’s meet later on for drinks and/or food.

!

151

Datasets:

ESS European Social Survey (used in Sections 6 and 8)

4th wave (2008)

http://ess.nsd.uib.no/

QOG Quality of Government (used in Sections 9, 10 and 11)

Most recent update (6 April 2011)

http://qog.pol.gu.se/

NHIS National Health Interview Survey (used in Sections 7, 9 and 10)

Last survey year (2009)

http://www.cdc.gov/nchs/nhis/

Commands: (in progress…)

extremes, 82 mvencode, 63

!

152

Applied examples:

Application 5a. Weighting a cross-national survey!Data: ESS 2008 34!

Application 5b. Weighting a multistage probability sample!Data: NHIS 2009 36!

Application 5c. Reading frequencies!Data: ESS 2008 39!

Application 6a. Locating and renaming variables!Data: ESS 2008 43!

Application 6b. Counting observations!Data: ESS 2008 45!

Application 6c. Selecting observations!Data: ESS 2008 45!

Application 7a. Subsetting to cross-sectional data!Data: NHIS 2009 53!

Application 8a. Inspecting a categorical variable!Data: NHIS 2009 57!

Application 8b. Inspecting a continuous variable!Data: NHIS 2009 58!

Application 8c. Labelling a dummy variable!Data: NHIS 2009 58!

Application 8d. Recoding continuous data to groups!Data: NHIS 2009 61!

Application 8e. Recoding dummies!Data: NHIS 2009 61!

Application 8f. Recoding bands!Data: NHIS 2009 63!

Application 8g. Encoding strings!Data: MFSS 2006 66!

Example 9a. Visualizing continuous data!Data: NHIS 2009 72!

Example 9b. Kernel density plots!Data: NHIS 2009 74!

Example 9c. Visualizing categorical data!Data: ESS 2008 74!

Example 9d. Survey weights and confidence intervals!Data: ESS 2008 76!

Example 9e. Democratic satisfaction!Data: ESS 2008 78!

Example 9f. Normality of the Body Mass Index!Data: NHIS 2009 82!

Example 9g. Transforming the Body Mass Index!Data: NHIS 2009 83!

Example 9h. Inspecting outliers!Data: NHIS 2009 85!

Example 9i. Keeping or removing outliers!Data: QOG 2011 85!

Example 10a. Trust in the European Parliament 89!

Example 10b. Female leaders and political regimes 91!

Example 10c. Religiosity and military spending 93!

Example 10d. Religion and interest in politics 95!

Example 10e. Legal systems and judicial independence 97!

Example 10f. Party support (with controls) 105!

Example 11a. Foreign aid and corruption 110!

Example 11b. Trust in institutions 112!

Example 11c. Subjective happiness 115!

This version: 23 February 2012.

Written using Stata 11/12 SE on Mac OS X Lion.

Typeset in Linotype Syntax and Menlo.

!

I rest my head on 115

But miracles only happen on 34th, so I guess life is mean

And death is the median

And purgatory is the mode that we settle in

– Cannibal Ox, “Iron Galaxy”

Stata Guide 2012

stata-guide-2012

Navigation menu

Versions of this User Manual:

Views

Navigation