Analysis Guide

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 208

DownloadAnalysis Guide
Open PDF In BrowserView PDF
MSDS 6371-405 Analysis Guide
David Josephs
October 13, 2018

Contents
I

Drawing Statistical Conclusions

8

1 Problem 1: Randomized Experiment vs Random Sample

9

2 Problem 2: Identifying Confounding Variables

10

3 Problem 3: Identifying a Scope of Inference

11

4 Problem 4: Visual comparison of population means and a permutation test

13

5 Unit 1 Lecture Slides

17

II

24

Inferences Using the t-distribution

6 Problem 1: A one sample t test
6.1 Complete Analysis
Hypothesis definition
Identification of a critical value and drawing a shaded t distribution
Value of Test Statistic
P value
Assessment of the Hypothesis test
Conclusion and scope of inference
Some R code

25
25
25
25
26
27
27
27
27

7 Problem 2: Two sample one sided t test
7.1 Permutation test
7.2 Two sample T test, full analysis
Hypothesis definition
critval and distribution
Calculation of the T statistic
P value
hypothesis assement
conclusion
Incorrect calculations
7.3 Rcode

28
28
29
30
30
31
31
31
31
32
33

8 Problem 3: two sample two sided t test
8.1 Full Analysis
Hypothesis Definition
Critical value and shaded distribution
T statistic
P value
Hypothesis Assessment
Conclusion and Scope of inference

34
34
34
34
35
36
36
36

1

Analysis Guide

Midterm

9 Problem 4: power
9.1 Single power curve
9.2 Multiple power curves
9.3 Calculating change in N

37
37
38
39

10 Unit 2 Lecture Slides

40

III A Closer look at Assumptions

50

11 Problem 1: Two Sample T test with assumptions
11.1 Complete Analysis
Assmuption checking in SAS
Assumption Checking in R
Complete Analysis:

51
51
51
53
55

12 Outliers and Logarithmic Transformations

59

13 Log Transformed data
13.1 Full Analysis
Problem Statement:
Assumptions
3.3 Hypothesis testing
Statement of Hypotheses:
Critical Value
Calculation of the t statistic:
Calculation of the p-value:
3.3.5 Discussion of the Null hypothesis
Conclusion

74
74
74
74
76
76
76
77
78
78
78

14 Unit 3 Lecture slides

79

IV Alternatives to the t tools

98

15 Problem 2: Logging problem
15.1 Complete Rank-Sum Analysis Using SAS
Problem Statement
Assumptions
Statement of the Hypothesis
Calculation of the P-value
Results of the Hypothesis Test
Statistical Conclusion
Scope of Inference
Confirmation Using R

99
99
99
99
99
99
100
100
101
101

16 Problem 3: Welch’s Two Sample T-Test with Education Data
16.1 Problem Statement and Assumptions
Problem Statement
Assumptions
16.2 Complete Analysis Using SAS
Statement of Hypotheses
Critical t Value
Calculation of the t Statistic
Calculation of the p Value
Results of Hypothesis Test
Conclusion
Scope of Inference
Verification using R

102
102
102
102
103
103
104
105
105
105
105
106
106

2

Analysis Guide

Midterm

Preferences

106

17 Problem 4: Trauma and Metabolic Expenditure rank sum
17.1 Hand-Written Calculations
17.2 SAS verification
17.3 Full Statistical Analysis
Problem Statement
Assumptions
Hypothesis definitions
Critical Value
Calculation of the z statistic
Calculation of the p value
Discussion of the hypothesis
Conclusion

107
107
110
110
110
110
111
111
112
112
112
112

18 Problem 5: Autism and Yoga signed rank
18.1 Hand-Written Calculations
18.2 Verification in SAS and R
Verification in SAS
Verification in R
18.3 6 step Sign Rank test using SAS
Statement of Hypothesis
Critical Values
Calculation of a Z statistic
Calculation of a p value
Assessment of hypothesis
Conclusion
18.4 Paired t test in SAS
Statement of Hypothesis
Critical Values
Calculation of a t statistic
Calculation of a P value
Assessment of Hypothesis
Conclusion
18.5 Confirmation with R
18.6 Complete Statistical Analysis
Assumptions
Statement of Hypothesis
Critical Values
Calculation of a t statistic
Calculation of a P value
Assessment of Hypothesis
Conclusion

113
113
115
115
115
115
115
116
117
117
117
117
117
117
117
118
118
118
118
119
119
119
120
120
121
122
122
122

19 sexy ranked permutation test

123

20 Unit 4 lecture slides

125

V ANOVA

136

21 Problem 1: Plots and Logged Data
21.1 Plots and Transformations
Raw Data Analysis
Transformed Data Analysis
21.2 Complete Analysis
Problem Statement
Assumptions

137
137
137
142
145
145
145

3

Analysis Guide

Midterm

Hypothesis Definition
F Statistic
P-value
Hypothesis Assessment
Conclusion
Scope of Inference
21.3 Extra Values
Value of R2
Mean Square Error and Degrees of Freedom
ANOVA in R!

145
146
146
146
146
147
147
147
147
147

22 Extra Sum of Squares
22.1 Building the Extra Sum of Squares Anova Table
22.2 Complete Analysis
Problem Statement
Assumptions
Hypothesis Definition
F Statistic
P-value
Hypothesis Assessment
Conclusion
Scope of Inference
22.3 Degrees of Freedom and Comparison to T-Test

149
149
150
150
150
150
151
151
151
151
151
151

23 Welch’s ANOVA
23.1 Complete Analysis
Problem Statement
Assumptions
Hypothesis Definition
F Statistic
P-value
Hypothesis Assessment
Conclusion
Scope of Inference

152
152
152
152
152
152
153
153
153
153

24 unit 5 lecture slides

154

VI

Multiple comparisons and post hoc tests

169

25 Bonferroni CIs

170

26 Multiple Comparison

173

27 Tukey’s test and Dunnett’s test
27.1 Assumptions
Raw Data Analysis
Transformed Data Analysis

177
177
177
180

28 Multiple samples
28.1 ANOVA
Problem Statement
Assumptions
Hypothesis Definition
F Statistic
P-value
Hypothesis Assessment
Conclusion
28.2 Tukey’s test

185
185
185
185
185
185
186
186
186
186
4

Analysis Guide

Midterm

Dunnett’s Test

188

29 Unit 6 lecture slides

190

VII Workflow for testing hypotheses

205

5

List of Codes
4.1
4.2
4.3
4.4

Creating Paneled histograms in SAS
Producing histograms in R
Two Tailed permutation test in SAS, using manually input groups
Two Tailed permutation test in R, using manually input groups

13
14
15
16

6.1
6.2
6.3
6.4

One sample t test in R with manual data input
Critical value and two sided shaded t distribution using SAS
One sample t test in SAS
one sample t test in r

25
26
26
27

7.1
7.2
7.3
7.4

A one sided permutation test in SAS
One sided shaded t distribution in SAS and Critval
Two sample t test using SAS
two sample t test in R

29
30
31
33

8.1 Two sided two sample t test in SAS

35

9.1 Proc power single with pooled variance
9.2 Producing several curves with proc power

37
38

11.1 Checking the assumptions of a t test in SAS
11.2 t test Assumption checking in R, Q-Q plot
11.3 t test Assumption checking in R, Histogram

51
54
54

12.1 Automatically input permutation test in SAS
12.2 Outlier removal in SAS

72
73

13.1 log transform in SAS

75

15.1 Exact rank sum test using SAS
15.2 wilcoxon rank sum test using R

100
101

16.1 welch’s t test

102

18.1 Signed Rank test in SAS
18.2 Paired T test in SAS

115
118

19.1 handcrafted rank sum test

123

21.1
21.2
21.3
21.4
21.5
21.6
21.7
21.8
21.9

137
138
139
140
142
142
143
143
144

Scatterplot of Raw Data Using SAS
Boxplot of Raw Data Using SAS
Histogram of Raw Data Using SAS
Q-Q of Raw Data Using SAS
Logging of Raw Data Using SAS
Scatterplot of Logged Data Using SAS
Boxplot of Logged Data Using SAS
Histogram of Logged Data Using SAS
Q-Q of Logged Data Using SAS
6

Analysis Guide

Midterm

21.10ANOVA Test Using SAS
21.11Comparison of distributions using SAS
21.12ANOVA in R

146
146
148

22.1 Regrouping data using SAS
22.2 Secondary ANOVA using SAS

149
149

23.1 Welch’s ANOVA in SAS

152

25.1 Bonferroni in SAS

170

26.1 all the multiple comparisons in SAS
26.2 Multiple comparisons with R

173
176

28.1 Tukeys test in SAS and R
28.2 DUnnett’s test

187
188

7

Part I

Drawing Statistical Conclusions

8

Chapter 1

Problem 1: Randomized Experiment vs
Random Sample
Question 1
What is the difference between a randomized experiment and a random sample? Under what type of study/sample
can a causal inference be made?

Answer to Question 1
A randomized experiment is when the the application of the experimental variable (“treatment”) is applied to subjects chosen randomly. So for example, in a study with 400 subjects, and treatments A, B, and a control group, each
subject would randomly be assigned into either the control group, group A, or group B. This is done to eliminate
confounding variables, as well as possible bias. In a random sample, subjects are randomly chosen from the population. This is done so that the subjects of the study can be assumed to be representative of the population as a
whole. [1]. We can make causal inferences from a randomized experiment, but not from a random sample.
Score: 20/20. Explanation: This answer gets full marks because it covers all of the points made in the key, it
defines both random sampling and randomization in the same manner as the key. However in the future it should
be less wordy.

9

Chapter 2

Problem 2: Identifying Confounding
Variables
Question 2
In 1936, the Literary Digest polled 1 out of every 4 Americans and concluded that Alfred Landon would win the
presidential election in a landon-slide. Of course, history turned out dramatically different (see http://historymatters.gmu.edu/d/5168/ for further details). The magazine combined three sampling sources: subscribers to its
magazine, phone number records, and automobile registration records. Comment on the desired population of
interest of the survey and what population the magazine actually drew from.

Answer To Question 2
The magazine had hoped to get a random sample, or a dichotomy of the voting population, which would be
representative of the entire voting population of the country as a whole. Instead, they only polled subscribers to
the magazine, phone number records, and automobile registration records. 1936 was in the height of the great
depression, which means that the average American was struggling to survive. Therefore, while in the past this
sampling techique had worked, this time around they ended up only sampling the wealthiest people, those who
could afford phones, cars, and magazine subscriptions, and the results were not representative of the population.
Without truly random sampling, “the statistical results only apply to [those] sampled”, and cannot be representative
of the entire population. [2]. Therefore, itis just chance that in the previous years, the polls worked.
Score: 10/10. Explanation: This answer gets full marks because it states that the poll wanted to cover all of the
voters (5 points), and it identifies the actual group polled with some explanation (affluent people) (5 points).

10

Chapter 3

Problem 3: Identifying a Scope of Inference
Question 3
3. Suppose we have developed a new fertilizer that is supposed to help corn yields. This fertilizer is so potent that
a small vial of it sprayed over an entire field is a sufficient dose. We find that the new fertilizer results in an average
yield of 60 more bushels over the old fertilizer with a p-value of 0.0001. Write up a scope of inference under the
following study designs that generated this data.
1. We offer the new fertilizer at a discount to customers who have purchased the old fertilizer along with a survey
for them to fill out. Some farmers send in the survey after the growing season, reporting their crop yield. From
our records, we know which of these farmers used the new fertilizer and which used the old one.
2. When a customer makes an order, we randomly send them either the old or new fertilizer. At the end of the
season, some of the farmers send us a report of their yield. Again, from our records, we know which of these
farmers used the new fertilizer and which used the old.
3. When a customer makes an order, we randomly send them either the old or new fertilizer. At the end of the
season, we sub-select from the fertilizer orders and send a team out to count those farmers’ crop yields.
4. We offer the new fertilizer at a discount to customers who have purchased the old fertilizer. At the end of the
season, we sub-select from the fertilizer orders and send a team out to count those farmers’ crop yields. From
our records, we know which of these farmers used the new fertilizer and which used the old one.

Answer
1. We cannot make causal inferences or inferences about the population, as it was not randomized or a random sample. Available units from distinct groups were selected, however the treatment was not assigned
randomly, which may mean only farmers who needed a change in fertilizer or were struggling and could not
afford the old fertilizer decided to go for the discount, and then the study is also only representative of those
who submitted reports, as no random sampling was done
Score: 8/8. Explanation: This answer gets full credit because it states that causal inferences cannot be made
and that population inferences cannot be made, which agrees with the key
2. We can make causal inferences but not inferences about the population. The treatment was applied at random to the subjects, but no random sampling was done. Therefore this study only speaks to the effect of the
treatment on farmers who submitted reports, which may mean that they had noteably different yields.
Score: 8/8. Explanation: This answer receives full credit because it states that causal inferences can be made,
and that population statements cannot be made, with explanations, all agreeing with the key
3. We can make causal inferences and inferences about the population. The farmers were randomly assigned
different treatments, which allows us to make causal inferences, and then the farmers were randomly selected
for the yield to be counted, which means that the selected farmers should be representative of the entire
population. With these experimental parameters, we can decide whether the new fertilizer worked better,
worse, or the same.
11

Analysis Guide

Midterm

Score: 7/8. Explanation: This answer loses a point because the problem does not explicitly state that the
sub sample was random. I assumed it was a random sample, and with that assumption, the answer is entirely
correct, however the randomness is not explicitly stated. Therefore a point is taken away. The rest of the
answer agrees entirely with the key, therefore no more points will be lost
4. We can make inferences about the population but not causal inferences. The treatment was not supplied
randomly, so maybe only farmers who needed a discount or the old fertilizer wasnt working for chose the new
fertilizer. However, they were randomly sampled, which means we can make inferences about the population
to some degree but we definitely cannot make causaul inferences.
Score: 7/8. Explanation: This answer loses a point because the problem does not explicitly state that the
sub sample was random. I assumed it was a random sample, and with that assumption, the answer is entirely
correct, however the randomness is not explicitly stated. Therefore a point is taken away. The rest of the
answer agrees entirely with the key, therefore no more points will be lost.

12

Chapter 4

Problem 4: Visual comparison of population
means and a permutation test
Question 4
4. A Business Stats class here at SMU was polled, and students were asked how much money (cash) they had in
their pockets at that very moment. The idea was to see if there was evidence that those in charge of the vending
machines should include the expensive bill / coin acceptor or if the machines should just have the credit card
reader. Also, a professor from Seattle University polled her class last year with the same question. Below are the
results of the polls. SMU 34, 1200, 23, 50, 60, 50, 0, 0, 30, 89, 0, 300, 400, 20, 10, 0 Seattle U 20, 10, 5, 0, 30, 50, 0,
100, 110, 0, 40, 10, 3, 0
1. Use SAS to make a histogram of the amount of money in a student’s pocket from each school. Does it appear
there is any difference in population means? What evidence do you have? Discuss your thoughts.
2. Use the following R code to reproduce your histograms. Simply cut and paste the histograms into your HW.
SMU = c(34, 1200, 23, 50, 60, 50, 0, 0, 30, 89, 0, 300, 400, 20, 10, 0) Seattle = c(20, 10, 5, 0, 30, 50, 0, 100, 110,
0, 40, 10, 3, 0) hist(SMU) hist(Seattle)
3. Run a permutation test to test if the mean amount of pocket cash from students at SMU is different than that
of students from Seattle University. Write up a statistical conclusion and scope of inference (similar to the one
from the PowerPoint). (This should include identifying the Ho and Ha as well as the p-value.)

Answer
1. Code (see Appendix 1) for the SAS histogram (Figure 1) was inspired by [3]. The code used to produce this
histogram is as follows:
Code 4.1. Creating Paneled histograms in SAS
proc sgpanel data=CashMoney;
panelby School / rows=2 layout=rowlattice;
histogram cash / binwidth = 25;
run;

13

Analysis Guide

Midterm

Figure 4.0.1. Distribution of Cash by School, produced in SAS

It appears that for the sample means, the SMU sample has a slighly higher mean, however I do not believe
that means that the population of SMU has a higher mean than Seattle U, as this was not a random sample, it
was just of business students. It appears that the SMU cash distribution is wider, with higher values, but
again it is hard to tell if it is indicative of the entire population, I believe, based off of where the majority of
the distributions lie, both populations would have similar means, with SMU having a slightly higher mean.
SMU is a private school and Seattle U is one of the best value schools in the country, so it is possible that
SMU students might have in general, more money than students at Seattle U, and therefore more cash.
Score: 5/5. Explanation: This receives full marks, the histograms are correct and the conclusions are similar
to the key, and are very logical. The code is included in the appendix.
2. The code used to generate the R histograms (Figure 2) was given in the homework and is presented below
Code 4.2. Producing histograms in R
1
2
3
4
5

SMU = c(34, 1200 , 23, 50, 60, 50, 0, 0, 30, 89, 0, 300, 400, 20, 10, 0)
Seattle = c(20, 10, 5, 0, 30, 50, 0, 100, 110, 0, 40, 10, 3, 0)
par(mfrow=c(1 ,2))
hist(SMU)
hist( Seattle )

Figure 4.0.2. Cash Distributions at SMU and Seattle U, Produced using R

he code used to generate the permutation test (Appendix 2), using SAS, is given in [4]. The results of the
14

Analysis Guide

Midterm

permutation test, with 999999 permutations can be seen in Figure 3 Below is SAS and R code for permutation
tests:
Code 4.3. Two Tailed permutation test in SAS, using manually input groups

proc iml;
G1 = {/*SMU student data*/};
G2 = {/*Seattle U student data*/};
obsdiff = mean(G1) - mean(G2); /*difference in the means of the two data sets*/
print obsdiff;
call randseed(12345); /* set random number seed */
alldata = G1 // G2; /* stack data in a single vector */
N1 = nrow(G1); N = N1 + nrow(G2);
NRepl = 999999; /* number of permutations, I did ~ 1 million just because I thought the
nulldist = j(NRepl,1); /* allocate vector to hold results */
do k = 1 to NRepl;
x = sample(alldata, N, "WOR"); /* permute the data */
nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]); /* difference of means */
end;
title "Histogram of Null Distribution";
refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);";/*build a nice
call Histogram(nulldist) other=refline;
pval = (1 + sum(abs(nulldist) >= abs(obsdiff))) / (NRepl+1); print pval;/*calculate the
/*https://blogs.sas.com/content/iml/2014/11/21/resampling-in-sas.html*/

Figure 4.0.3. Results of Permutation Tests

And some R code: In this test, the null hypothesis is that there is no difference between the mean amount
of cash in a student’s pocket in the two groups, while the alternative hypothesis is that there is a meaningful
difference between the two[4]. The permutations were used to generate the null distribution of differences,
and the red line shows where the experimental difference lies. Further calculation shows that the p value of
the experimental mean was 0.149, meaning about 15% of the null distribution is greater than our mean[5].
With a 5 or 10 % confidence interval, we cannot reject the null hypothesis, and therefore we cannot say there
is any difference between the two means. The SMU students and Seattle U students have more or less the
same amount of cash in their pockets, the result of the study does not bear statistical inference. As for scope
of inference, this was not a randomized experiment or random sample, and therefore we cannot make any
causal inferences (there was no treatment applied, and we definitely cannot say going to SMU makes you
have more or less money in your pocket than going to Seattle U), and we cannot make any inferences about
the student bodies as a whole (population inferences). The sample is only representative of the students
sampled, so we have very little scope of inference.
15

Analysis Guide

Midterm

Code 4.4. Two Tailed permutation test in R, using manually input groups
1
2
3
4

school1 <- rep('SMU ', 16)
school2 <- rep('Seattle ', 14)
school <- as. factor (c(school1 , school2 ))
all.money <- data. frame (name=school , money=c(SMU , Seattle ))

5
6
7
8
9
10

t.test(money ~ name , data=all.money)
number _of_ permutations <- 1000
xbarholder <- numeric (0)
counter <- 0
observed _diff <- mean( subset (all.money , name == "SMU")\$ money)-mean( subset (all.money ,
name == " Seattle ")\$ money)

11
12
13
14
15
16
17
18
19
20
21
22
23

24
25
26
27

set.seed (123)
for(i in 1: number _of_ permutations )
{
scramble <- sample (all.money\$money , 30)
smu <- scramble [1:16]
seattle <- scramble [17:30]
diff <- mean(smu)-mean( seattle )
xbarholder [i] <- diff
if(abs(diff) > abs( observed _diff))
counter <- counter + 1
}
hist(xbarholder , xlab='Permuted SMU - Seattle ', main='Histogram of Permuted Mean
Differences ')
box ()
pvalue <- counter / number _of_ permutations
pvalue
observed _diff

Score: 15/15. Explanation: This receives full marks, 5 points for running the test, 5 points for the p value, and
5 points for mentioning the null and alternative hypotheses and getting the correct conclusion. The code is
included in the Appendix.

16

Chapter 5

Unit 1 Lecture Slides

17

10/12/2018

Symbols!

MSDS 6371:
Lecture 1

Standard
Mean Deviation Variance
Sample
Population

DRAWING STATISTICAL CONCLUSIONS

R A N D O MI ZE D E X P E RI ME NT S V. O BS E RVAT IO NAL S T U DIE S
R A N D O M S A M P LE S V. S E L F - SEL ECTI ON

Creativity Scores:
Intrinsic vs. Extrinsic Motivation
Subjects volunteered for the study.
Then, treatments were randomly assigned.

Starting Salaries:
Female vs. Male
Subjects were NOT randomly
chosen by the researcher (all
employees at a bank were
included), and the group
assignments were not random
either.
If a random sample of the
employees had been used…

1

10/12/2018

Causal Inference:
Randomized vs. Observational Study

Types of Studies
Creativity Study

• Causal inferences can be drawn from randomized experiments

Randomized
Experiment

• Causal inferences cannot be drawn from observational studies due to CONFOUNDING
CONFOUNDING VARIABLE: Related to both group membership and to the outcome

Salary Study

Example: Since 2000, the U.S. median wage…

Observational
Study

•has overall increased about 1%
•has decreased for high school (or below) dropouts and high school graduates (no college)
•Is this a paradox?

Why do an observational
study?

Causal Inference:
Randomized vs. Observational Study
• Causal inferences can be drawn from randomized experiments

• Establishing causation not always the goal

• Causal inferences cannot be drawn from observational studies due to CONFOUNDING
What are some possible confounding variables in the gender/salary study?
In
the
starting
salaries
study,
maybe males have
• more education
• more seniority
• more age (older)
• more willingness
to
negotiate
starting salary

No, more people are going to college.

• Predict whether or not an email is spam

•Randomization may not be ethical
• Assign subjects of a clinical trial of a cancer drug to treatment or placebo

Older

Younger

o
y
y
o
y
y
y
o
y
o
o

o
o
o
y
y
y
o
y
o
y
y

o
y
y
o
y
y
y
o
y
o
o
y

o
o
o
y
y
y
o
y
o
y
y

In a randomized experiment, variables like age are also randomly distributed to each group,
removing the confounding effect.

•May be arguable scientifically that a confounder is “unlikely”
• 6 month smoking ban in Helena, MT coinciding with 40%
reduction in heart attacks

• Might have an incidentally observed dataset
• Walmart collects petabytes of data/day. Should this data
be discarded because it is observational?

2

10/12/2018

Inference to Populations:
Random Sample vs. Self-Selection

Inference to Populations:
Random Sample vs. Self-Selection

• Inference to populations can be drawn from a RANDOM SAMPLE FROM THAT POPULATION.

• Inference to populations can be drawn from a RANDOM SAMPLE

• Inference to populations cannot be drawn if units are self-selected. In this creativity
example, inference can only be drawn to the subjects in the sample that was taken.

• Inference to populations cannot be drawn if units are self-selected

RANDOM SAMPLE: Experimental units selected via a “chance mechanism” from a well
defined population
Example: call randomly selected phone numbers for a survey.
•

What is the population from which the sample is taken? If drawing from a physical
phone book, is it the people who live in the city?

•

Would this sampling method result in inferences to different populations if it were
used in 1950? 1990? Present day?

SIMPLE RANDOM SAMPLE: Every subset of size n is equally likely
Example: I’ll assign everyone in this class a random integer 17, 200, -3, 472, … and
survey the n people (units) with smallest numbers

Statistical Inferences
Permitted by Study Design

•WHICH OF THE STUDIES USES RANDOM SAMPLING?
• Neither study uses random
sampling
• Creativity study: units
are volunteers
• Bank study: units are
the entire staff
• No inference about a larger
population is possible
• Does not mean the results
are not interesting or
compelling!

x

Practice with Scope: Q1
A particular study focused on high school freshman and seniors and their GPAs in a
required economics class. The study consisted of enumerating every freshman and
senior in the school and randomly selecting them from that sampling frame. Their
scores in the economics class were then recorded, and a hypothesis test for the
difference of means was conducted. The seniors were found to have a significantly
greater mean score in the class than the freshman. What sort of conclusions can be
made from this study? In other words, what is the scope of this study? In this class,
scope typically constitutes both the causal inferences and populations inferences.
Since the subjects cannot be randomly assigned to be freshman or seniors, this is an observational
study, and thus the difference in mean scores is only associated with the freshman / senior status.
We can’t tell if the class (freshman or senior) caused the difference or not.
The sample was a random sample from the school; therefore, these findings can be generalized to
all freshman and seniors in the school. In conclusion, it can be inferred that the mean economics
score of the seniors in the school is greater than that of the freshman although the cause of this
difference cannot be determined from this study.

3

10/12/2018

Practice with Scope: Q2

x

The Navy is very interested in the effects of sleep deprivation on cognitive ability. In order
to test the effect, the Navy put out a radio advertisement asking for 18 to 35 year old
nonsmokers to participate in the study. The volunteers were then placed in either the
control group (no sleep deprivation) or the treatment group (36 hours of sleep deprivation)
based on the flip of a fair coin (Heads = Control, Tails = Treatment). After the data was
collected, the sleep deprived group was found to have a significantly lower mean math score
than the group not deprived of sleep. What sort of conclusions can be made from this
study? In other words, what is the scope of this study (causal inferences and population
inferences)?

Drawing Statistical
Conclusions

Since the subjects were randomly assigned to the control and treatment groups, this is a
randomized experiment; thus, the difference in mean scores can be concluded to be
caused by the sleep deprivation. Since the subjects were volunteers who responded to a
radio advertisement, it is easy to see that every member of the population did not have
the same chance of being selected, and thus the sample is NOT a random sample.
Therefore these findings cannot be generalized to all U.S. nonsmokers between the age
of 18 and 35. In conclusion, it can be inferred that sleep deprivation caused the decrease
in cognitive ability (as measured by the timed math test) for these 57 individuals only.

M EA S U R ING U N C E RTA IN TY IN RA N D O MI ZED A N D
O B S E RVATIO NAL S T UD IES

Creativity Study

Creativity Study

4 out of 6 groupings have test statistics as extreme or more extreme than the
original grouping.
As extreme or more extreme means the absolute value of the test statistic is at
least 4.5.
So the p-value is 4/6 = 0.667. This answers the question of how unusual our
test statistic would be if the treatments had the same effect.

For the sake of the example, supposed there are only 4 subjects.
I

I
E

Int (Grp 1)

Ext (Grp 2)

12 Bob

5 Dan

17 Sue

15 Sal

Avg. 14.5

Avg. 10

To quantify “large,” we can randomly reallocate units to two groups and recompute
the difference in sample means many times.
*Everyone has the same score with each grouping. The group each person is
artificially put in changes with each regrouping. If the treatments had the same
effect, then each participant would have the same score regardless of grouping.

Diff 14.5 – 10 = 4.5

(NULL HYPOTHESIS)

All other possible groupings:
(TEST STATISTIC)

(ALTERNATE HYPOTHESIS)

(Grp 1)

(Grp 2)

12 Bob

17 Sue

5 Dan

15 Sal

Avg. 8.5

Avg. 16

Diff 8.5 – 16 = -7.5

(Grp 1)

(Grp 2)

(Grp 1)

(Grp 2)

12 Bob

5 Dan

15 Sal

5 Dan

15 Sal

17 Sue

17 Sue

12 Bob

Avg. 13.5

Avg. 11

Avg. 16

Avg. 8.5

Diff 13.5 – 11 = 2.5

Diff 16 – 8.5 = 7.5

(Grp 1)

(Grp 2)

(Grp 1)

5 Dan

12 Bob

5 Dan

12 Bob

17 Sue

15 Sal

15 Sal

17 Sue

Avg. 11

Avg. 13.5

Avg. 10

Avg. 14.5

Diff 11 – 13.5 = -2.5

(Grp 2)

Diff 10 – 14.5 = -4.5

4

10/12/2018

Creativity Study: all 47 subjects

Creativity Study:
Testing the Hypothesis
Number of
random
regroupings:
1.6 x 1013

I
E

1000
different
groupings
(relabelings)*

Half a year with a
computer that can
perform a million
calculations
per
second!

4.14

-4.14

(P-VALUE)

*Everyone has the same score with each grouping. What group each person is artificially put in changes with each regrouping. If the
treatments had the same effect, then each participant would have the same score regardless of grouping.

Creativity Study

Creativity Study
(go to SAS code)

treatment
0
1
Diff (1-2)
Diff (1-2)

Method

Pooled
Satterthwaite

Mean
19.8833
15.7391
4.1442
4.1442

95% CL Mean
18.0087
21.7580
13.4677
18.0105
1.2914
6.9970
1.2776
7.0108

4.14

-4.14

The TTEST Procedure
Variable: score

1000 different
groupings
(relabelings)
Std Dev
4.4395
5.2526
4.8541

95% CL Std Dev
3.4504
6.2276
4.0623
7.4343
4.0261
6.1138
Obs Variable

Class

Method

Variances

Mean

Low erCLMean

UpperCLMean

StdDev

Low erCLStdDev

UpperCLStdDev UMPULow erCLSt UMPUUpperCLSt
dDev
dDev

1 COL139

Diff (1-2)

Pooled

Equal

4.4678

1.6594

7.2762

4.7786

3.9635

6.0187

3.9360

5.9708

2 COL170

Diff (1-2)

Pooled

Equal

-4.3192

-7.1485

-1.4899

4.8141

3.9930

6.0634

3.9653

6.0152

3 COL279

Diff (1-2)

Pooled

Equal

-4.5576

-7.3530

-1.7623

4.7564

3.9451

5.9908

3.9178

5.9430

4 COL360

Diff (1-2)

Pooled

Equal

-4.8897

-7.6340

-2.1454

4.6695

3.8731

5.8814

3.8462

5.8345

5 COL537

Diff (1-2)

Pooled

Equal

7.2031

4.7991

3.9806

6.0446

3.9530

5.9964

4.3826

1.5621

6 COL551

Diff (1-2)

Pooled

Equal

-5.0514

-7.7692

-2.3337

4.6243

3.8356

5.8245

3.8090

5.7781

7 COL604

Diff (1-2)

Pooled

Equal

-4.7109

-7.4832

-1.9385

4.7172

3.9127

5.9415

3.8855

5.8942

8 COL664

Diff (1-2)

Pooled

Equal

4.6636

1.8840

7.4431

4.7295

3.9228

5.9569

3.8956

5.9095

There is strong evidence to suggest that the mean score of those who receive intrinsic motivation is not equal to those who receive the
extrinsic motivation (p-value = .008). The burden to reject the null hypothesis is lower under a one-sided test, so we can say that the
evidence supports the claim that the intrinsic mean is higher than the extrinsic mean.
Since this was a randomized experiment, we can conclude that the intrinsic motivation caused this increase. In addition, since these were
volunteers, this inference can only be assumed to apply to these 47 subjects, although the findings are very intriguing.

5

10/12/2018

From Randomized to
Observational Studies
•In the Creativity study, the Intrinsic/Extrinsic groups were randomly
assigned to subjects
•This motivated comparing the observed difference to re-randomized
difference to test a hypothesis about the questionnaire having no effect
•This is known as a RANDOMIZATION TEST
•In observational studies, the groups are not randomly assigned

Appendix

•Though not technically the same test, we can still apply exactly the
same re-randomization idea to observational data
•However, now it is called a PERMUTATION TEST

Age Discrimination
In the United States, it is illegal to discriminate against people based on various
attributes. One such attribute is age. An active lawsuit, filed August 30, 2011, in the
Los Angeles District Office is a case against the American Samoa Government for
systematic age discrimination by preferentially firing older workers.
Is there evidence for age discrimination in this study?
Data sampled at random from all American Samoa government workers:

Age Discrimination (Two Sided)
Fired
34 37 37 38 41 42 43 44 44 45 45
45 46 48 49 53 53 54 54 55 56
Not fired
27 33 36 37 38 38 39 42 42 43 43 44
44 44 45 45 45 45 46 46 47 47 48 48
49 49 51 51 52 54

-1.9238

1.9238

1000 different
groupings
(relabelings)

Fired
34 37 37 38 41 42 43 44 44 45 45 45 46 48 49 53 53 54 54 55 56
Not fired
27 33 36 37 38 38 39 42 42 43 43 44 44 44 45 45 45 45 46 46 47 47 48 48 49 49 51 51
52 54

There is not sufficient evidence to suggest that the mean age of those who were fired is different from the mean age of those who were not fired (p-value =
0.204). The p-value is so high that even the null hypothesis of a one-sided test cannot be rejected. (There is insufficient evident to claim that the mean age of
fired employees is greater than that of not fired employees.)
Since this was a random sample of government employees in Samoa, we can generalize this inference to all government-employed people in Samoa.
Note: since we FTR (fail to reject) Ho, there is no need to discuss causation or association.

6

Part II

Inferences Using the t-distribution

24

Chapter 6

Problem 1: A one sample t test
Question 1
The world’s smallest mammal is the bumblebee bat, also known as the Kitti’s hog nosed bat. Such bats are roughly
the size of a large bumblebee! Listed below are weights (in grams) from a sample of these bats. Test the claim that
these bats come from the same population having a mean weight equal to 1.8 g. (Beware: This data is NOT the
same as in the lecture slides!) Sample: 1.7 1.6 1.5 2.0 2.3 1.6 1.6 1.8 1.5 1.7 1.2 1.4 1.6 1.6 1.6
1. Perform a complete analysis using SAS. Use the six step hypothesis test with a conclusion that includes a statistical conclusion, a confidence interval and a scope of inference (as best as can be done with the information
above … there are many correct answers given the vagueness of the description of the sampling mechanism.)
2. Inspect and run this R Code and compare the results (t statistic, p-value and confidence interval) to those you
found in SAS. To run the code, simply copy and paste the below code into R.
Code 6.1. One sample t test in R with manual data input
sample = c(1.7 , 1.6, 1.5, 2.0, 2.3, 1.6, 1.6, 1.8, 1.5, 1.7, 1.2, 1.4, 1.6, 1.6 , 1.6)
t.test(x=sample , mu = 1.8, conf.int = "TRUE", alternative = "two. sided ")

1
2

Answer
6.1

Complete Analysis

Hypothesis definition
H0 :µ = 1.8

(6.1.1)

H1 :µ 6= 1.8

(6.1.2)

Identification of a critical value and drawing a shaded t distribution
We have that n = 15 → df = n−1 = 14, α = 0.05. We input this into SAS and get our lovely shaded distribution and
critical value with the following code: This gives us a critical t value of ±2.14479, as seen in the following figures:
Figure 6.1.1. Critical t value

25

Analysis Guide

Midterm

Code 6.2. Critical value and two sided shaded t distribution using SAS
data critval;
p = quantile("T",.975,14); /*two sided test*/;
proc print data=critval;
run;
data pdf;
do x = -4 to 4 by .001;
pdf = pdf("T", x, 14);
if x <= quantile("T",.025,14) then lower = pdf;
else lower = 0;
if x >= quantile("T",.975,14) then upper = pdf;
else upper = 0;
output;
end;
run;
title 'Shaded t distribution';
proc sgplot data=pdf noautolegend noborder;
yaxis display=none;
band x = x lower = lower upper = upper / fillattrs=(color=gray8a);
series x = x y = pdf / lineattrs = (color = black);
series x = x y = lower / lineattrs = (color = black);
run;

Value of Test Statistic
The t statistic was calculated using the following SAS code
Code 6.3. One sample t test in SAS
proc ttest data=bats h0=1.8
sides=2 alpha=0.05;
run;

t=

x̄ − µ
√s
n

≈

1.65 − 1.8
0.25
15

26

= −2.35

Analysis Guide

Midterm

P value
This gives us a p-value of p = 0.0342

Assessment of the Hypothesis test
From here we can see that p = .0342<α = .05, indicating that we REJECT the null hypothesis, which claims that
µ = 1.8

Conclusion and scope of inference
We cannot say that this sample of bats comes from a population with a mean weight of 1.8 grams (p value = 0.0242
from a two sided t test). Below is a graph produced with the code from step 4 which shoes a 95% confidence interval
on the distribution of the data (green) vs the null hypothesis(gray bar)

The mean of 1.8 lies outside the reasonable range of the data from the sample, and as our hypothesis test
showed, vice versa is also true. We cannot say that our sample of bats has a mean weight of 1.8, and it is difficult to
say that it came from a population of mean 1.8. However, we cannot make any conclusions about the population
this sample came from, because it is not a random sample (we also clearly cant make any causal inferences), We
only know, with 95% confidence, that our sample does not have a mean of 1.8 grams, and that is about all we can
say.

Some R code
Code 6.4. one sample t test in r
1
2
3
4

sample <- c(1.7 , 1.6, 1.5, 2.0, 2.3, 1.6, 1.6,
1.8, 1.5, 1.7, 1.2, 1.4, 1.6, 1.6, 1.6)
t.test(x=sample , mu = 1.8,
conf.int = "TRUE", alternative = "two. sided ")

27

Chapter 7

Problem 2: Two sample one sided t test
Question
2. In the United States, it is illegal to discriminate against people based on various attributes. One example is age.
An active lawsuit, filed August 30, 2011, in the Los Angeles District Office is a case against the American Samoa
Government for systematic age discrimination by preferentially firing older workers. Though the data and details
are currently sealed, suppose that a random sample of the ages of fired and not fired people in the American
Samoa Government are listed below: Fired 34 37 37 38 41 42 43 44 44 45 45 45 46 48 49 53 53 54 54 55 56 Not
fired 27 33 36 37 38 38 39 42 42 43 43 44 44 44 45 45 45 45 46 46 47 47 48 48 49 49 51 51 52 54
a. Perform a permutation test to test the claim that there is age discrimination. Provide the Ho and Ha, the
p-value, and full statistical conclusion, including the scope (inference on population and causal inference). Note:
this was an example in Live Session 1. You may start from scratch or use the sample code and PowerPoints from
Live Session 1.
b. Now run a two sample t-test appropriate for this scientific problem. (Use SAS.) (Note: we may not have talked
much about a two-sided versus a one-sided test. If you would like to read the discussion on pg. 44 (Statistical
Sleuth), you can run a one-sided test if it seems appropriate. Otherwise, just run a two-sided test as in class. There
are also examples in the Statistics Bridge Course.) Be sure to include all six steps, a statistical conclusion, and scope
of inference.
c. Compare this p-value to the randomized p-value found in the previous sub-question.
d. The jury wants to see a range of plausible values for the difference in means between the fired and not fired
groups. Provide them with a confidence interval for the difference of means and an interpretation.
f. Inspect and run this R Code and compare the results (t statistic, p-value, and confidence interval) to those you
found in SAS. To run the code, simply copy and paste the code below into R.

Answers
7.1

Permutation test

First, a permutation test is ran using n = 9999, using the code I wrote in homework one, inspired by [2]. The code
used to run the permutation test is shown below: In this scenario, we have that:
H0 :µf − µuf ≤ 0
H1 :µf − µuf > 0
where the null hypothesis is that the average age of the unfired individuals is the same as the average age of the
fired individuals, and the alternative is that the average age of the individuals who were fired is higher. The results
of the permutation test are as follows:

28

Analysis Guide

Midterm

Code 7.1. A one sided permutation test in SAS
obsdiff = mean(G1) - mean(G2); /*G1 and G2 represent the two groups*/
print obsdiff;
call randseed(12345);
/* set random number seed */
alldata = G1 // G2;
/* stack data in a single vector */
N1 = nrow(G1);
N = N1 + nrow(G2);
NRepl = 9999;
/* number of permutations */
nulldist = j(NRepl,1);
/* allocate vector to hold results */
do k = 1 to NRepl;
x = sample(alldata, N, "WOR");
/* permute the data */
nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]);
/* difference of means */
end;
title "Histogram of Null Distribution";
refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);";
call Histogram(nulldist) other=refline;
pval = (1 + sum(abs(nulldist) >= (obsdiff))) / (NRepl+1);
print pval;

In the above figure, the red line represents the mean of the difference between the two samples, and the rest
of the bars represent our null distribution. SAS tells us that the P-value is 0.2812, meaning 28.12 percent of the null
distribution is greater than our sample mean. Therefore, with a 5%, or even a 10% confidence interval, we cannot
reject the null hypothesis. We cannot say whether or not there was age discrimination in the firing of workers with
the given sample. With this procedure, we can make generalizations about the population, and generalize about
all of the government-employed people in Samoa, as we did a random sample, however, we cannot make causal
inferences, as there may be confounding variables in the system, and we did not run a randomized experiment.
There is also no need to discuss causal problems, because we failed to reject the null hypothesis.

7.2

Two sample T test, full analysis

This time we will conduct a t test on the two data sets to determine whether age discrimination occured or not.
Because we believe the older workers may have been fired, we are going to perform a one sided t-test.

29

Analysis Guide

Midterm

Hypothesis definition
First we construct our hypotheses:
H0 :µf − µuf ≤ 0
H1 :µf − µuf > 0

critval and distribution
Next we draw and shade our distribution:
In a two sample t-test, we have that:
df = nf + nnf − 2
where in our case, df = 21 + 30 − 2 = 49, α = 0.05
Now we input this information into SAS to draw our distribution[1]:
Code 7.2. One sided shaded t distribution in SAS and Critval
data pdf;
do x = -4 to 4 by .01;
pdf = pdf("T", x, 49);
lower = 0;
if x >= quantile("T",0.95,49) then upper = pdf;/*one sided*/
output;
end;
run;
title 'Shaded t distribution';
proc sgplot data=pdf noautolegend noborder;
yaxis display=none;
band x = x
lower = lower
upper = upper / fillattrs=(color=gray8a);
series x = x y = pdf / lineattrs = (color = black);
series x = x y = lower / lineattrs = (color = black);
run;
data critval;
p = quantile("T",.95,49); /*one sided test*/;
proc print data=critval;
run;

Giving us this lovely graph:

30

else upper = 0;

Analysis Guide

Midterm

Next we find a number for the critical value, using the same code as problem 1:

This gives us a critical t value of 1.67655.

Calculation of the T statistic
Next we calculate our two sample t statistic using SAS:
Code 7.3. Two sample t test using SAS
proc ttest data=samoa
alpha=.05 test=diff
sides=U;
class fired;
var age;
run;

Which tells us that our t statistic is 1.10

P value
With the code from the previous step, we also see the p value:

p = 0.1385

hypothesis assement
p = 0.1385 > α = 0.05 for the one tailed hypothesis test, indicating that we CANNOT REJECT the null hypothesis

conclusion
The p value for the t test was about half of the p value for the random test, I believe this is because I ran a one-sided
t test. It is interesting to note that if you do a two sided t-test in SAS, you get roughly the same value for p as in the
permutation test:

31

Analysis Guide

Midterm

This means that maybe a permutation test is a good estimator of the two-sided t-test.
We cannot reject the null hypothesis, meaning we cannot say that older workers were fired from the samoan
government. Note that we used a one tailed hypothesis test in this scenario, as we wanted to deternine if the fired
group was OLDER than the nonfired group. As a result of this test, we cannot say that the fired group was older
than the unfired group, and since this sample was random, we can say the same thing about the entire samoan
government. However, we cannot make causal inferences and there is no need to because we did not reject the
null hypothesis
We can provide a lot of confidence intervals for the jury. I think the most telling is the one sided confidence
interval, which would tell us what difference in the means constitutes age discrimination. This was produced using
the following SAS code:
proc ttest data=samoa
alpha=.05 test=diff
sides=U; /*an upper tailed test*/
class fired;
var age;
run;

which gives us a confidence interval of [−1.0107, ∞). This confidence interval represents the upper difference
of means at a 95% confidence level. We can interpret this as follows: if the confidence interval contains the null
hypothesis, then we cannot reject it. However if it does not contain the null hypothesis, we must reject it. As we
can see in this beautifully drawn figure, the null hypothesis, µf − µnf ≤ 0 is contained within our CI:

. This means we cannot reject the null hypothesis, we cannot say there was age discrimination. It is plausible that
the mean differnence of the entire population of samoan government employees is less than or equal to zero, as it is
within the 95% confidence interval, which means we cannot, as objective jurors, claim there was age discrimination.

Incorrect calculations
The pooled sample standard deviation, sp , is defined as
Pk
(ni − 1)s2i
2
sp = Pi=1
k
i=1 (ni − 1)
which for us is:

r

(21 − 1)(6.5214)2 + (30 − 1)(5.8835)2
= 6.152
20 + 29
The equation for standard error in the difference of means is given as
s
s21
s2
σx¯1 −x¯2 =
+ 2
n1
n2
sp =

Which gives us that
r
σx¯1 −x¯2 =

6.52142
5.88352
+
= 1.811
21
30
32

Analysis Guide

7.3

Midterm

Rcode

The following code (supplied in the homework) was put into R: returning this:
Code 7.4. two sample t test in R
1
2
3
4
5
6
7
8

1
2
3
4
5
6

Fired = c(34, 37, 37, 38, 41, 42, 43,
44, 44, 45, 45, 45, 46, 48, 49, 53,
53, 54, 54, 55, 56)
Not_ fired = c(27, 33, 36, 37, 38, 38,
39, 42, 42, 43, 43, 44, 44, 44, 45,
45, 45, 45, 46, 46, 47, 47, 48, 48,
49, 49, 51, 51, 52, 54)
t.test(x = Fired , y = Not_fired , conf.int = .95, var. equal = TRUE , alternative = " greater
")

Two Sample t-test
data: Fired and Not_fired
t = 1.0991 ,
df = 49,
p-value = 0.1385 alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval: -1.010728
Inf sample estimates: mean of x mean of y

45.85714

43.93333

The results are near identical, I cannot tell which one is better but I imagine R is more accurate as well, but just a
very small difference between the results in all regards . The var.Equal statement is important because it uses the
pooled test.

33

Chapter 8

Problem 3: two sample two sided t test
Question
3. In the last homework, it was mentioned that a Business Stats professor here at SMU polled his class and asked
students them how much money (cash) they had in their pockets at that very moment. The idea was that we wanted
to see if there was evidence that those in charge of the vending machines should include the expensive bill / coin
acceptor or if it should just have the credit card reader. However, another professor from Seattle University was
asked to poll her class with the same question. Below are the results of our polls.
SMU 34, 1200, 23, 50, 60, 50, 0, 0, 30, 89, 0, 300, 400, 20, 10, 0 Seattle U 20, 10, 5, 0, 30, 50, 0, 100, 110, 0, 40,
10, 3, 0 a. Run a two sample t-test to test if the mean amount of pocket cash from students at SMU is different than
that of students from Seattle University. Write up a complete analysis: all 6 steps including a statistical conclusion
and scope of inference (similar to the one from the PowerPoint). (This should include identifying the Ho and Ha as
well as the p-value.) Also include the appropriate confidence interval. FUTURE DATA SCIENTIST’S CHOICE!: YOU
MAY USE SAS OR R TO DO THIS PROBLEM! b. Compare the p-value from this test with the one you found from the
permutation test from last week. Provide a short 2 to 3 sentence discussion on your thoughts as to why they are
the same or different.

Answer
8.1

Full Analysis

Hypothesis Definition
Hypothesis set up:
H0 :µ1 − µ2 = 0
H1 :µ1 − µ2 6= 0

Critical value and shaded distribution
Next we draw and shade our distribution: In a two sample t-test, we have that:
df = n1 + n2 − 2
where in our case, df = 16 + 14 − 2 = 28, α = 0.05. In this case we are performing a two tailed test. Now we input
this information into SAS to draw our distribution[1]:
data pdf;
do x = -4 to 4 by .001;
pdf = pdf("T", x, 14);
/*here it is important to set up a two sided test*/
if x <= quantile("T",.025,28) then lower = pdf;
else lower = 0;
if x >= quantile("T",.975,28) then upper = pdf;

34

Analysis Guide

Midterm

else upper = 0;
output;
end; run;
title 'Shaded t distribution';
proc sgplot data=pdf noautolegend noborder;
yaxis display=none;
band x = x lower = lower upper = upper / fillattrs=(color=gray8a);
series x = x y = pdf / lineattrs = (color = black);
series x = x y = lower / lineattrs = (color = black);
run;

With this bit of code, we have produced our shaded two tailed PDF:

This critical value, where the bands start, is calculated using the following SAS code:
data critval;
p = quantile("T",.975,28); /*two sided test*/;
proc print data=critval;
run;

This gives us a critical t value of ±2.04841

T statistic
the t stat is calculated using the following code:
Code 8.1. Two sided two sample t test in SAS
proc ttest data=wallet
alpha=.05 test=diff
sides=2; /*an upper tailed test*/
class school;
var cash;
run;

which tells us that our t statistic is −1.37

35

Analysis Guide

Midterm

P value
With the code from the previous step, we also see the p value, p = 0.1812:

Hypothesis Assessment
p = 0.1812 > α = 0.05 for the one tailed hypothesis test, indicating that we CANNOT REJECT the null hypothesis

Conclusion and Scope of inference
We cannot reject the null hypothesis, meaning we cannot say that the mean amount of cash in an SMU student’s
wallet is any different than the mean amount of cash in a Seattle U student’s wallet. The following figure is a good
reference for the results of this test:

The circled area tells us the difference between the mean amount of cash in a Seattle student’s wallet and an
SMU student’s wallet. We can see that the average student from the seattle sample had about 112 dollars less in
his wallet than the average SMU student. This may sound like a lot, however it is not significant. For this result to be
statistically significant, and the mean amount of cash in a Seattle U student’s wallet to be considered different than
the mean amount of cash in an SMU student’s wallet, the difference of the two means would have to fall outside
of the 95% confidence interval. The confidence interval is highlighted, and is (−281.2, 55.6817), which tells us that
for the means to be considered truly different, the seattle student should have either 281 dollars less than the SMU
student, or 55 dollars more. Our p value of 0.1812 tells us a similar story. It tells us that there is an 18% chance that
a greater difference in the means would occur, which, at a 5 or 10 percent confidence interval, is not statistically
significant at all. As for scope of inference, we cannot make inferences about the greater population of either
university, because these were not random samples. We also cannot make causal inferences (eg going to SMU
makes you have money in your wallet!), as this is not a randomized experiment either. Something about outliers!

36

Chapter 9

Problem 4: power
Question
4. A. Calculate the estimate of the pooled standard deviation from the Samoan discrimination problem. Use this
estimate to build a power curve. Assume we would like to be able to detect effect sizes between 0.5 and 2 and we
would like to calculate the sample size required to have a test that has a power of .8. Simply cut and paste your
power curve and SAS code. HINT: USE THE CODE FROM DR. McGEE’s lecture. Instead of using groupstddevs,
use stddev since we are using the pooled estimate. B. Now suppose we decided that we may be able to live with
slightly less power if it means savings in sample size. Provide the same plot as above but this time calculate curves
of sample size (y-axis) vs. effect size (.5 to 2) (x axis) for power = 0.8, 0.7, and 0.6. There should be three plots on
your final plot. Simply cut and paste your power curve and SAS code. HINT: USE THE CODE FROM DR. McGEE’s
lecture. Instead of using groupstddevs, use stddev since we are using the pooled estimate. The effect size here
refers to a difference in means, though there are many effect size metrics, such a Cohen’s D. C. Using similar code,
estimate the savings in sample size from a test aimed at detecting an effect size of 0.8 with a power of 80% versus
a power of 60%. Note: You will learn how to do this in R in a future HW!

Answers
9.1

Single power curve

he pooled standard deviation, calculated in Problem 2, part e, part 1, is sp = 6.5215. The difference of the means of
the two groups, meandiff in the code, is just set to the difference between the means of our two populations, calculated using the R-generated means in Problem 2, Part f, µf − µuf = 1.924. The value of meandiff is not important,
because by plotting the effect size, we are cycling through mean differences between 0.5 and 6, so the meandiff
parameter only really matters if you want to know a sample size for a specific difference of means. When building
a power curve it is not important at all, but you need it to get proc power to work. The SAS code used to build the
power curve is shown below:
Code 9.1. Proc power single with pooled variance
proc power;
twosamplemeans
/*test=diff not diffsatt bc pooled variance*/
test=diff
stddev=6.5215
/*meandiff is a dummy variable in this case*/
meandiff=1.924
power=.8
ntotal = .;
plot x=effect min=.5 max=6;
run;

37

Analysis Guide

Midterm

And the power curve:

9.2

Multiple power curves

The same notes as above apply here, this time we used the SAS code to generate multiple power curves:
Code 9.2. Producing several curves with proc power
proc power;
twosamplemeans
/*test=diff not diffsatt bc pooled variance*/
test=diff
stddev=6.5215
/*meandiff is a dummy variable in this case*/
meandiff=1.924
power=.8 .7 .6
ntotal = .;
plot x=effect min=.5 max=6;
run;

And the curves:

38

Analysis Guide

9.3

Midterm

Calculating change in N

It is important to remember that the “effect size” calculated in this SAS code is the exact same thing as the “mean
difference”. Therefore we can write our SAS code as follows:
proc power;
twosamplemeans
test=diff /*diff not diffsatt bc pooled variance*/
stddev=6.5215
meandiff= 0.8 /*this represents the effect size*/
power=.8 .6
ntotal = .;
run;

Which gives us our sample size savings:

As we see from the figure above, by raising the power from 0.6 to 0.8, we actually have to nearly double the
sample size to meet the test parameters. By using a power of 0.6, we save 784 N’s (or sample size units)

39

Chapter 10

Unit 2 Lecture Slides

40

10/13/2018

Inference Using
t-Distributions

Central Limit Theorem

M E A S U R I N G U N C E R TA I N T Y I N R A N D O M I Z E D A N D O B S E RVAT I O N A L
STUDIES
- D I S T R I B U T I O N O F T H E S A M P L E AV E R A G E
-USING T-DISTRIBUTION FOR ONE SAMPLE INFERENCE
- S TA R T I N G T O E X P LO R E T - D I S T R I B U T I O N F O R T W O S A M P L E P R O B L E M S

1

2

Distribution of Sample Average

Distribution of Sample Average
is unbiased.

is a point estimate for

µ

The sample mean is an unbiased estimator for the population mean.

3

The more data you pick for each sample, the more normal (and tighter) the distribution of
the sample mean is.
Note that the
distribution of the
original data is the
distribution of a
sample mean of size
1.

4

The more data you pick for each sample, the more normal (and tighter) the distribution of the sample
mean is.
If original data is approx. normal, then the distribution of the sample mean will be approx. normal,
regardless of sample size.

µ
http://onlinestatbook.com/stat_sim/sampling_dist/

1

10/13/2018

Value (x)

Trial

X1

4

3.5

X2

3

3.5

X3

3

Dice: Individual Rolls (n = 1)

1
6

…

…

Frequency

X5

1000

4

2

2000

3

1500

5.5

800

Dice: Sample Means of Size n = 2

Frequency

Trial

1000
500

…

600

…

400

0
1 1.5

2

3.5

200

2.5

3 3.5
4 4.5
Average of 2 Dice

0

X6000

1

5

2

3

4

5

5.5

6

4

5

6

Roll of the Die

Trial

Trial
3.6

3.3

3

3.1
Dice: Sample Means of Size n = 5

2

2.9

1800

3

Dice: Sample Means of Size n = 10

4.3

1600
1400

…

3.1

1000
800

…

600
400

…

4.2

200

1

0
1

…

1.5

2

2.5

3

4.2

3.5

4

4.5

3.7
5

5.5

6

1600
1400
1200
1000
800
600
400
200
0
1

Average of 5 Dice

3.4

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

Average of 10 Dice

THE CENTRAL LIMIT
THEOREM!!!

Dice: Individual Rolls (n = 1)

CENTRAL LIMIT THEOREM Cont.

0
2

3

4

5

6

Roll of the Die

Dice: Sample Means of Size n = 2
2000
1500
1000
500
0
1 1.5 2
2.5 3 3.5

4

4.5

5

5.5

6

Average of 2 Dice

Dice: Sample Means of Size n = 10
Frequency

1

Frequency

Frequency

…

1800

Frequency

Frequency

5.5
…

2000

1200

2000
1000
0
1 1.5 2
2.5 3 3.5

4

4.5

5

5.5

6

Average of 10 Dice

2

10/13/2018

T-ratio
is unbiased est. for

Example: If we have data 79, 83, 84, 89, 90 mm for digitus tertius (the
human middle finger). What is an estimate of the standard deviation?
*This ratio HAS a t – distribution if Y is normally distributed.

13

Student t Distributions for
n = 3 and n = 12

14

Example: 1 Sample Confidence Interval
Student t
distributions have
the same general
shape and
symmetry as the
standard normal
distribution but
reflect a greater
variability (heavier
tails), which is
expected with
small samples.

The following are ages of 7 randomly selected patrons at the
Beach Comber in South Mission Beach at 7pm. We assume that
the data come from a normal distribution and would like to
build a 95% confidence interval for the actual mean age of
patrons at the Comber.

25, 19, 37, 29, 40, 28, 31

William Sealy Gosset (Student)

n=7
= 29.86
σ = 7.08
 = 0.05
/2 = 0.025
z/2 = 1.96

x – E < µ < x + E, where

E = z2 σ

=

(1.96)(7.08) = 5.24

n
29.86 – 5.24 < µ

IMPORTANT:
These are the
7
plausible values
< 29.86 + 5.24 of the mean
given the data!

24.62 <  < 35.10

We are 95% confident that the mean age of Beach Comber patrons at
7pm is contained in any 95% confidence interval, such as
(24.62 years, 35.10 years).

n=7
= 29.86

s = 7.08

x – E < µ < x + E, where

E = t2, n-1 s

 = 0.05
/2 = 0.025

t/2, n-1 = 2.447

n
29.86 – 6.55 < µ

=

(2.447)(7.08) = 6.55

IMPORTANT:
These are the
plausible values
< 29.86 + 6.55
of the mean
given the data!

7

23.31 <  < 36.41

We are 95% confident that the mean age of Beach Comber patrons at 7pm
is contained any 95% confidence interval, such as (23.31 yrs., 36.41 yrs.).

3

10/13/2018

Comparison of z to t
E=z

σ

1 Sample Hypothesis Testing:
The 6 Steps

=

(1.96)(7.08) = 5.24
2
n=7
= 29.86
n
7
σ = 7.08
We are 95% confident that
x
–
E
<
µ
<
x
+
E
 = 0.05
the mean age of Beach
/2 = 0.025
Comber patrons at 7pm is
29.86
–
5.24
<
<
29.86
+
5.24
µ
contained in the interval
z/2 = 1.96
24.62 <  < 35.10 (24.62 years, 35.10 years).
23.31 24.62

E = t2, n-1 s

n=7
= 29.86
s = 7.08

 = 0.05
/2 = 0.025

=

n

29.86 – 6.55 < µ < 29.86 + 6.55

t/2, n-1 = 2.447

23.31 <  < 36.41

35.10 36.41

(2.447)(7.08) = 6.55
7

We are 95% confident that the
mean age of Beach Comber
patrons at 7pm is contained in
the interval (23.31 years, 36.41
years).

1.

Identify Ho and Ha.

2.

Find the Critical Value(s) and Draw and Shade.

3.

Calculate the Test – Statistic. (The evidence!)

4.

Calculate the P-value.

5.

Make a decision… Reject Ho or FTR Ho.

6.

Write a clear conclusion in the context of the problem…. Use mostly
non statistical terms but always report the p-value! Add a
confidence interval if appropriate. End this conclusion with a
statement about the scope.

20

Example: 1 Sample t-test

Let’s Formalize This Test Into 6 Steps!
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

The following are ages of 7 randomly chosen patrons seen leaving
the Beach Comber in South Mission Beach at 7pm. We assume that
the data come from a normal distribution and would like to test the
claim that the mean age of the distribution of Comber patrons is
different than 21.

25, 19, 37, 29, 40, 28, 31

Let’s Formalize This Test Into 6 Steps!
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

Step 2: Draw and Shade and Find the Critical Value.

Let’s Formalize This Test Into 6 Steps!
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.
Step 2: Draw and Shade and Find the Critical Value.
df = 7 – 1 = 6

df = 7 – 1 = 6
21
t

21
t

Step 3: Find the test statistic. (The t value for the data.)

4

10/13/2018

Let’s Formalize This Test Into 6 Steps!

Let’s Formalize This Test Into 6 Steps!

Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

Step 2: Draw and Shade and Find the Critical Value.

Step 2: Draw and Shade and Find the Critical Value.

Step 3: Find the test statistic. (The t value for the data.)

Step 3: Find the test statistic. (The t value for the data.)
Step 4: Find the p-value: The probability of observing by random
chance something as extreme or more extreme than what was
observed under the assumption that the null hypothesis is true.
(Usually found with software.) The red shaded region above is 0.0162
(sum of both red areas)

Step 4: Find the p-value: P-value 0.0162< .05
Step 5: Key! The sample mean we found is very unusual under the
assumption that the true mean age is 21. So we Reject the
assumption that the true mean age is 21. That is, we REJECT Ho.

Let’s Formalize This Test Into 6 Steps!
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

Finding the P-value – more detail

Step 2: Draw and Shade and Find the Critical Value.

Step 4: Find the p-value: p-value < .05
You could use Stat Trek / or the t-table.

Confidence interval

OR

Step 3: Find the test statistic. (The t value for the data.)

Software like SAS:

Step 4: Find the p-value: P-value 0.0162 < .05
Step 5: REJECT Ho

Step 6: There is sufficient evidence to conclude that the true mean age of patrons at the
Comber at 7pm is not equal to 21 (p-value =0.0162 from a t-test). We could also say that
there is sufficient evidence to conclude that the true mean is greater than 21. (Consider the
red area in the right most tail.) This was not a random sample of all times, only at 7pm; thus,
the result cannot be applied to the bar at all times. The results are nevertheless intriguing.
28

One-Sided Test + Two-Sided CI Demonstration

One-Sided Test + Two-Sided CI Demonstration

29

30

5

10/13/2018

One-Sided Test + Two-Sided CI Demonstration

One-Sided Test + Two-Sided CI Demonstration

Suppose we would like to test the claim that the mean age of patrons is

Suppose we would like to test the claim that the mean age of patrons is

greater than 24.

greater than 24.

One Sided-Test at alpha = 0.05

Two Sided-Test at alpha = 0.05

Two Sided-Test at alpha = 0.1

Two Sided-Test at alpha = 0.05

31

32

One-Sided Test + Two-Sided CI Demonstration

TWO SAMPLE T-TEST FOR THE
DIFFERENCE OF MEANS WITH
INDEPENDENT SAMPLES
Perform a two sample t-test for the difference in the mean score between the
Intrinsic and Extrinsic groups from the chapter problem. Provide a complete
analysis, including a full conclusion, confidence interval, and scope of inference. Use
an alpha = .01 level of significance.

33

Let’s Formalize This Test Into 6 Steps!
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

34

Let’s Formalize This Test Into 6 Steps!
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

Step 2: Draw and Shade and Find the Critical Value.
df = 24 +23 – 2 = 45
Which is equivalent to:
0
t

6

10/13/2018

Let’s Formalize This Test Into 6 Steps!

Let’s Formalize This Test Into 6 Steps!

Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

Step 2: Draw and Shade and Find the Critical Value.

Step 2: Draw and Shade and Find the Critical Value.
df = 24 +23 – 2 = 45
0
t

Step 3: Find the test statistic. (The t value for the data.)
Step 4: Find the p-value: The probability of observing by random
chance something as extreme or more extreme than what was
observed under the assumption that the null hypothesis is true.
(Usually found with software.) The red shaded regions above. 0.0054

Step 3: Find the test statistic. (The t value for the data.)

Let’s Formalize This Test Into 6 Steps!

Let’s Formalize This Test Into 6 Steps!
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

Step 2: Draw and Shade and Find the Critical Value.

Step 2: Draw and Shade and Find the Critical Value.

Step 3: Find the test statistic. (The t value for the data.)

Step 3: Find the test statistic. (The t value for the data.)
Step 4: Find the p-value: P-value 0.0054< .01
Step 5: REJECT Ho

Step 6: There is sufficient evidence to suggest that those who receive the Intrinsic treatment have a
different mean score than those who receive the Extrinsic treatment (p-value = .0054 from a t-test). We can
also claim that the mean intrinsic score is greater than the extrinsic one. (The burden of rejecting the null
hypothesis for a one-tailed test is less than a two-tailed test, given the test is in the relevant direction.) A
99% confidence interval for this difference is (.3347, 7.95). Since this was a randomized experiment, we can
conclude that the Intrinsic treatment caused this difference. However, since the study was of volunteers
(sampling bias), this inference can only be generalized to the 47 participants.

Step 4: Find the p-value: P-value 0.0054< 0.01

COMPARE WITH RANDOMIZATION
(PERMUTATION) TEST

Finding the P-value
Step 4: Find the p-value: P-value < .01

4.14

-4.14

1000 different
groupings
(relabelings)

You could use Stat Trek / or the t-table.
OR
Software like SAS:

Obs Variable

Class

Method

Variances

Mean

Low erCLMean

UpperCLMean

StdDev

Low erCLStdDev

UpperCLStdDev

UMPULow erCLSt
dDev

UMPUUpperCLSt
dDev

1 COL139

Diff (1-2)

Pooled

Equal

4.4678

1.6594

7.2762

4.7786

3.9635

6.0187

3.9360

5.9708

2 COL170

Diff (1-2)

Pooled

Equal

-4.3192

-7.1485

-1.4899

4.8141

3.9930

6.0634

3.9653

6.0152

3 COL279

Diff (1-2)

Pooled

Equal

-4.5576

-7.3530

-1.7623

4.7564

3.9451

5.9908

3.9178

5.9430

4 COL360

Diff (1-2)

Pooled

Equal

-4.8897

-7.6340

-2.1454

4.6695

3.8731

5.8814

3.8462

5.8345

5 COL537

Diff (1-2)

Pooled

Equal

4.3826

1.5621

7.2031

4.7991

3.9806

6.0446

3.9530

5.9964

6 COL551

Diff (1-2)

Pooled

Equal

-5.0514

-7.7692

-2.3337

4.6243

3.8356

5.8245

3.8090

5.7781

7 COL604

Diff (1-2)

Pooled

Equal

-4.7109

-7.4832

-1.9385

4.7172

3.9127

5.9415

3.8855

5.8942

8 COL664

Diff (1-2)

Pooled

Equal

4.6636

1.8840

7.4431

4.7295

3.9228

5.9569

3.8956

5.9095

There is strong evidence to suggest that the mean score of those who receive intrinsic motivation is not equal to those who receive the
extrinsic motivation (p-value = .008). The burden to reject the null hypothesis is lower under a one-sided test, so we can say that the
evidence supports the claim that the intrinsic mean is higher than the extrinsic mean.
Since this was a randomized experiment, we can conclude that the intrinsic motivation caused this increase. In addition, since these were
volunteers, this inference can only be assumed to apply to these 47 subjects, although the findings are very intriguing.
41

7

10/13/2018

Let’s Talk Power!!!

Explore power!
Here is an applet that will show you what happens to the power/beta
when you change the sample size, alpha, standard deviation, or effect
size (measure of the difference between null mean and actual
(alternative) mean).
http://shiny.stat.tamu.edu:3838/eykolo/power/

Effect size basically
measures the
difference between
the population mean
(106) and the null
mean(100). (It’s not
exactly this, though.)
43

44

Pick all that are true.
The power increases when:

(Go to break out)
Consider the following options.
A. The probability of rejecting Ho when the null is true.

A. The sample size decreases.

B. The probability of accepting Ho when the null is true.

B. The sample size increases.

C. The probability of rejecting Ho when the null is false.

C. The standard deviation / standard error decreases.

D. The probability of FTR Ho when the null is true.

D. The effect size increases.

E. The probability of FTR Ho when the null is false.

E. The effect size decreases.

C
WHICH IS POWER? ___
A
WHICH IS ALPHA? ___
E
WHICH IS BETA? ___

45

Pick all that are true.
The power increases when:

46

Appendix

A. The sample size decreases.
B. The sample size increases.
C. The standard deviation / standard error decreases.
D. The effect size increases.
E. The effect size decreases.

47

48

8

10/13/2018

Distribution of Sample
Average
ANOTHER EXAMPLE
FOR PRACTICE

49

50

H0:  = 1.8
H1:  ≠ 1.8
 = 0.05
x = 1.713
s = .2588
H0:  = 1.8
H1:  ≠ 1.8
 = 0.05
x = 1.713
s = .2588

Critical Values t = ± 2.145

On the basis of this test, there is not enough evidence to reject the claim that the mean weight of
bumblebee bats is equal to 1.8g (p-value = .2155 from a t-test). A 95% confidence interval is (1.57 g,
1.8566 g). The problem was ambiguous on the randomness of the sample; thus, we will assume that it
was not a random sample, which makes inference to all bats strictly speculative.

9

Part III

A Closer look at Assumptions

50

Chapter 11

Problem 1: Two Sample T test with
assumptions
Question
1. In the United States, it is illegal to discriminate against people based on various attributes. One example is age.
An active lawsuit, filed August 30, 2011, in the Los Angeles District Office is a case against the American Samoa
Government for systematic age discrimination by preferentially firing older workers. Though the data and details
are currently sealed, suppose that a random sample of the ages of fired and not fired people in the American
Samoa Government are listed below: Fired 34 37 37 38 41 42 43 44 44 45 45 45 46 48 49 53 53 54 54 55 56 Not
fired 27 33 36 37 38 38 39 42 42 43 43 44 44 44 45 45 45 45 46 46 47 47 48 48 49 49 51 51 52 54 a. Check the
assumptions (with SAS) of the two-sample t-test with respect to this data. Address each assumption individually
as we did in the videos and live session and make sure and copy and paste the histograms, q-q plots or any other
graphic you use (boxplots, etc.) to defend your written explanation. Do you feel that the t-test is appropriate? b.
Check the assumptions with R and compare them with the plots from SAS. c. Now perform a complete analysis of
the data. You may use either the permutation test from HW 1 or the t-test from HW 2 (copy and paste) depending on
your answer to part a. In your analysis, be sure and cover all the steps of a complete analysis: 1. State the problem.
2. Address the assumptions of t-test (from part a). 3. Perform the t-test if it is appropriate and a permutation test
if it is not (judging from your analysis of the assumptions). 4. Provide a conclusion including the p-value and a
confidence interval. 5. Provide the scope of inference.

Answer
11.1

Complete Analysis

Assmuption checking in SAS
The assumptions were tested using proc ttest, which outputs histograms, box plots, QQ-plots, and performs an
F-test on the variances. The code used to produce all information in this section is presented below:
Code 11.1. Checking the assumptions of a t test in SAS
proc ttest data=samoa
alpha=.05 test=diff
sides=U; /*an upper tailed test*/
class fired;
var age;
run;

51

Analysis Guide

Midterm

Normality
The normality of the data is checked using a QQ plot, a boxplot, and a histogram. First we will examine the QQ
plot:
Figure 11.1.1. Q-Q Plot for Normality

In Figure 1.1, the y axis represents the data set, and the x axis the theoretical normal quantile. The line represents
what a normal data set should look like, a 1-1 ratio between the data variable and the theoretical normal quantile.
The data set follows the normal line pretty well, so in this case on a visual inspection, we can say both samples are
normal. We can double check this using Figure 1.2, a histogram and boxplot:
Figure 11.1.2. Histogram and Boxplot for Normality

It is a bit harder to assess the normality using the histogram and boxplot, but SAS gives us useful kernel lines
which show the distribution of the data in the histogram (the red line is the data and the blue line is normal). As we
can see, the data loosely follows the normal distribution, it is a bit different but it is pretty close. The box plot tells
the same story, as in both cases the mean is very near the medium (in a normal distribution the mean and median
are the same), with slight left and right skewing, but overall we can assume the data is normal.
Equal Variances
In order to assess the equality of the variances visually, we can again use the histogram and boxplot, this time
displayed in Figure 1.3 (for ease of grading):

52

Analysis Guide

Midterm

Figure 11.1.3. Histogram and Boxplot for Variance Equality

As we can see from the bounds of the histogram, the range of each data set is more or less the same size, with
their means more or less in the center. This hints that the two data sets would have near equal variances. This is
confirmed when looking at the box plot, the distance from the mean to the far left whisker and far right whisker
is more or less the same for both data sets, which indicates again the variances are equal. This is confirmed by
examining the F test for equal variances, the results of which are displayed below:
Figure 11.1.4. F Test for Equal Variances

The F test is valid here, because the data is normal and the sample size is large (n ∼ 30), and we see that the
probability the variance difference is greater than what it is in our case is 60%, or a p value of 0.6 At a 5, 10, 15
or 20 percent confidence interval, the f test will tell us the variances are equal. Therefore, we can assume equal
variances.
Independence
In this case, we can assume independence, the two data sets do not relate to each other. Any dependence that
exists we will assume away, for the sake of the problem
Conclusion
In my opinion, we can use a t-test for this data set, based on the fact that all the assumptions are true.

Assumption Checking in R
Normality test
To test for normality, we are going to again use the Q-Q plot and the histogram. To produce the Q-Q plots, the
following code was used: The plots produced are shown below:

53

Analysis Guide

Midterm

Code 11.2. t test Assumption checking in R, Q-Q plot
1
2
3
4
5
6
7
8

# producing adjacent Q-Q plots
par(mfrow=c(1 ,2))
qqnorm (Fired ,main=" Normal Q-Q Plot for Fired data",
xlab = " Normal Quantiles ",
ylab = " Fired Quantiles ")
qqnorm (Not_fired ,main=" Normal Q-Q Plot for Not Fired data",
xlab = " Normal Quantiles ",
ylab = "Not Fired Quantiles ")

Figure 11.1.5. Q-Q plots for Normality in R

From the linearity of the data points in this figure, we can see that the data follows a more or less normal ditribution. The Q-Q plot produced in R is almost exactly the same as the Q-Q plot produced using SAS, however it is
different in that it does not have a lovely line representing perfect normality, and the size of the boxes changes with
window size, as does the aspect ratio, which is a bit of a pain. The following code is used to produce a histogram,
further examining normality: This produces the following figure:
Code 11.3. t test Assumption checking in R, Histogram
1
2
3
4

# producing the adjacent histograms
par(mfrow=c(1 ,2))
hist(Fired)
hist(Not_ fired)

54

Analysis Guide

Midterm

Figure 11.1.6. Histogram for Normality in R

As can be seen in the figure, the distribution of these two data sets is again more or less normal, with what
appears to be the mean and median lying in the center, however there is a bit of a bump in the fired data set,
but again it is loosely normal in appearance. The graphs again look the same as in SAS more or less, other than
formatting differences. We can identify numbers better in R. In this case, we can ASSUME NORMAL
Equality of Variances
Looking at the histogram in Figure 1.6, we can see that the fired data has a mean of about 45 years old, spanning
from 30 to 60, and the not fired data has a mean of about 40 years old, spanning from 25 to 55. The spread of the
two means is more or less the same in this case, therefore we can ASSUME EQUAL VARIANCEs
Independence
We can again assume independence.
Conclusion:
The t-test is appropriate

Complete Analysis:
Problem statement:
We would like to test the claim that the mean age of the individuals who were fired is greater than the mean age
of the individuals who were not fired.
Assumptions:
We can assume normality, independence, and equal variances and therefore we can use the student t test, as
proven in sections 1.a and 1.b.
t-test
Statement of the Hypotheses:
H0 :µf − µuf ≤ 0
H1 :µf − µuf > 0

55

Analysis Guide

Midterm

Shaded Distribution and Critical Values:

In a two sample t-test, we have that:
df = nf + nnf − 2

where in our case, df = 21+30−2 = 49, α = 0.05 Now we input this information into SAS to draw our distribution[1]:
data pdf;
do x = -4 to 4 by .01;
pdf = pdf("T", x, 49);
lower = 0;
if x >= quantile("T",0.9,49) then upper = pdf;/*one sided*/
else upper = 0;
output;
end;
run;
title 'Shaded t distribution';
proc sgplot data=pdf noautolegend noborder;
yaxis display=none;
band x = x
lower = lower
upper = upper / fillattrs=(color=gray8a);
series x = x y = pdf / lineattrs = (color = black);
series x = x y = lower / lineattrs = (color = black);
run;

Giving us this lovely graph:

Next we find a number for the critical value, using the same code as problem 1:
data critval;
p = quantile("T",.95,49); /*one sided test*/;
proc print data=critval;
run;

This gives us a critical t value of 1.67655.
Calculation of t statistic:

Next we calculate our two sample t statistic using SAS:

56

Analysis Guide

Midterm

proc ttest data=samoa
alpha=.05 test=diff
sides=U;
class fired;
var age;
run;

Which tells us that our t statistic is 1.10

Calculation of P-value With the code from the previous step, we also see the p value:

p = 0.1385
Discussion of the Null Hypothesis p = 0.1385 > α = 0.05 for the one tailed hypothesis test, indicating that we
CANNOT REJECT the null hypothesis
Conclusion:
We cannot reject the null hypothesis, meaning we cannot say that older workers were fired from the Samoan government. Note that we used a one tailed hypothesis test in this scenario, as we wanted to deternine if the fired
group was OLDER than the nonfired group. With a one-sided p-value of 0.1385, there is a nearly 14% chance that
there be a greater difference in mean ages given the distribution. At a critical p-value of .05 (5%), we can say that
this data fails to reject the null hypothesis. Using the code that calculated the t statisitic, we produce the following
one sided confidence interval:

The confidence interval is: [−1.0107, ∞). This confidence interval represents the upper difference of means at
a 95% confidence level. We can interpret this as follows: if the confidence interval contains the null hypothesis,
then we cannot reject it. However if it does not contain the null hypothesis, we must reject it. As we can see in this
beautifully drawn figure, the null hypothesis, µf − µnf ≤ 0 is contained within our CI:

57

Analysis Guide

Midterm

. This means we cannot reject the null hypothesis, we cannot say there was age discrimination. It is plausible that
the mean differnence of the entire population of samoan government employees is less than or equal to zero, as it is
within the 95% confidence interval, which means we cannot, as objective jurors, claim there was age discrimination.
Scope of Inference:
Since this sample was random, we can make generalizations about the Samoan Government as a whole, however,
we cannot make causal inferences, as this was not a randomized experiment.

58

Chapter 12

Outliers and Logarithmic Transformations

59

Analysis Guide

Midterm

The permutation test was performed using the following code: We will now perform the same procedure on
the assumptions without an outlier, as well as some other comparisons. Unless otherwise noted, the following code
was used to produce the results and to remove outliers:

71

Analysis Guide

Midterm

Code 12.1. Automatically input permutation test in SAS
/*Permutation test*/
data Wallet;
INFILE 'file location';
INPUT school $ cash;
run;
proc iml;
use Wallet var {school cash};
/*making two groups in IML*/
read all var {cash} where(school='SMU') into g1;
read all var {cash} where(school='SEU') into g2;
obsdiff = mean(g1) - mean(g2);
print obsdiff;
call randseed(12345);
/* set random number seed */
alldata = g1 // g2;
/* stack data in a single vector */
N1 = nrow(g1);
N = N1 + nrow(g2);
NRepl = 9999;
/* number of permutations */
nulldist = j(NRepl,1);
/* allocate vector to hold results */
do k = 1 to NRepl;
x = sample(alldata, N, "WOR");
/* permute the data */
nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]); /* difference of means */
end;
title "Histogram of Null Distribution";
refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);";
call Histogram(nulldist) other=refline;
pval = (1 + sum(abs(nulldist) >= abs(obsdiff))) / (NRepl+1);
/*this means two sided test*/
print pval;
run;

72

Analysis Guide

Midterm

Code 12.2. Outlier removal in SAS
data Wallet;
INFILE 'file location';
INPUT school \$ cash;
run;
data CleanCash;
set Wallet;
/*we are going to remove all the really high values*/
if cash >150 then delete;
run;
proc ttest data=CleanCash
alpha=.05 test=diff
sides=2; /*a 2 tailed test*/
class school;
var cash;
run;

73

Chapter 13

Log Transformed data
13.1

Full Analysis

Problem Statement:
We would like to test the claim that the distribution of incomes for those who have 16 years of education is greater
than those who have 12 years of education.

Assumptions
We first produce the plots for our assumption analysis using the following bit of code:
proc import
/*to use proc import first we specify the file*/
datafile='genericfilepath/genericname.csv'
/*then we specify the name of the output dataset*/
out=edudata /*then we specify the data type*/
dbms=CSV;
run;
proc sort data=edudata;
by descending educ;
run;
proc ttest data=edudata
order=DATA /*This changes theorder of the groups you are using to the one you set*/
sides=U; /*an Upper tailed test*/
class Educ;
var Income2005;
run;

Producing the following figures:
Figure 13.1.1. Q-Q plot of sample

74

Analysis Guide

Midterm

Figure 13.1.2. Histogram and Boxplot of the sample

Normality assumption:
Looking at the Q-Q plot(Figure 3.1), it is clear to see that the data is not normal at all. To investigate further, we will
look at the histograms and box plots in Figure 3.2. These paint a more complete picture, we see that the data is
skewed to the right, and that the higher values are much greater than the lower values (hundreds of thousands of
times). To combat this, lets perform a natural log transformation with this bit of code and see whatthe data looks
like:
Code 13.1. log transform in SAS
data edudata2;
set edudata;
lincome=log(Income2005);
run;
proc ttest data=edudata2
order=DATA sides=U; /*an Upper tailed test*/
class Educ;
var lincome;
run;

Producing the following figures:
Figure 13.1.3. Q-Q plot of logs

75

Analysis Guide

Midterm

Figure 13.1.4. Histogram and Boxplot of Logs

With this transformation, we first look at the Q-Q plot (Figure 3.3), and we see that the data is mostly normal!
Looking at the histograms (Figure 3.4) this is confirmed, just in their shape and the shape of the kernel density
plots. The nearness of the median to the mean is also a telltale sign the data is normal. Therefore, we can assume
the log-transformed data is normal.
Equality of Variances
Since we cannot assume normality with the untransformed data, it makes little sense to analyze the equality of
variances of that data set. We will look at the log transformed data for the equality of variances. Looking at figure
3.4, we see that the spread of the two data sets is pretty similar, just in the histograms, they are of similar length,
where the 12 year data set is a bit narrowerthan the 16 year set. The Boxplot confirms this, the distance from the
means to the end of the whiskers is roughly the same for both plots, as well as within the IQRS. The one with the
larger mean also has a larger variance, Therefore, we can assume the log transformed data has equal variances.
Independence
We can assume the data is independent in this scenario.

3.3 Hypothesis testing
We will be using a one tailed pooled t test of the log transformation of the data in this scenario, so that we can do
a t test

Statement of Hypotheses:
Note that since we are dealing with a pooled t-test of a log transformation, we are dealing in medians rather than
means, the medians should tell us whether or not the distribution of the people with 16 years of education exceeds
that of those with 12 years of education
H0 :M edian16 = M edian12
H1 :M edian16 > M edian12
H0 : distribution16 =distribution12
H1 : distribution16 >distribution12

Critical Value
In this scenario, α = 0.1 and df = 1424, and from that we can shade a one sided distribution and find a critical value,
using the code below:

76

Analysis Guide

Midterm

data pdf;
do x = -4 to 4 by .01;
pdf = pdf("T", x, 1424);
lower = 0;
if x >= quantile("T",0.9,1424) then upper = pdf;/*one sided*/
else upper = 0;
output;
end; run;
title 'Shaded t distribution';
proc sgplot data=pdf noautolegend noborder;
yaxis display=none;
band x = x
lower = lower
upper = upper / fillattrs=(color=gray8a);
series x = x y = pdf / lineattrs = (color = black);
series x = x y = lower / lineattrs = (color = black);
run;
data critval;
p = quantile("T",.9,1424); /*one sided test*/;
proc print data=critval; run;

This produces the shaded distribution:
Figure 13.1.5. Shaded t distribution

and a critical value of t = 1.28215

Calculation of the t statistic:
Now we calculate our t statististic using the code from Section 3.2.1, which tells us that t = 10.98, which is an
astounding value!

77

Analysis Guide

Midterm

Calculation of the p-value:
p < 0.0001, see the figure above!

3.3.5 Discussion of the Null hypothesis
We REJECT the null hypothesis, p ≈ 0 < 0.1 = α

Conclusion
We Reject the null hypothesis which states that the two distributions are equal. We have convincing evidence
that the income distribution of the people with 16 years of education is greater than those with 12. With a onesided p value of ~0, the distributions are very different, the median income of the people with a 16 year education
is evidently greater than the median income of people with a 12 year education. The figure below shows the
difference between the natural logarithm of the two medians:

This tells us that the median income of people with 16 years education is e0.5699 = 1.77 times greater than those
with 12 years of education. A 90% confidence interval for this multiplicative effect is 1.62 to 1.93 times.

We cannot make causal inferences in this scenario, as there was no random experimentation, and we cannot
make population inferences either, as there was no random sampling

78

Chapter 14

Unit 3 Lecture slides

79

10/13/2018

Confidence
Intervals
and
Hypothesis
Tests

Chapter 3

α = .05

95% CI
Vs.
α = .05 Hyp Test

A Closer Look at Assumptions!

For the corresponding
alpha, a (1-alpha)% CI will
contain mu_0 when the
test of Ho: mu = mu_0
fails to reject Ho and will
not contain mu_0 when
the test rejects Ho.
1

2

The Take Away

α = .01

Two-Sided 100(1-α)% Confidence Intervals are Equivalent to TwoTailed Hypothesis Tests that have an α level of significance.

Confidence
Intervals
and
Hypothesis
Tests

“Equivalent” here means that if we test any specific value in the
interval, the test will FTR Ho. And if we test any specific value outside
the interval, the test will Reject Ho.
Example:
95% confidence interval for the mean is equivalent to an α = .05
hypothesis test.

99% CI
Vs.
α = .01 Hyp Test

Example:
99% confidence interval for the mean is equivalent to an α = .01 level
hypothesis test.

3

Assumptions of one sample T-Tests

So we can evaluate hypothesis tests through the
evaluation of confidence intervals!

4

Robustness of One Sample T-test / CI
When the original (population) distribution is not
normal, the one sample t-test is still valid with a
large enough sample size. (Central Limit Theorem)
That is, the one sample t-test is robust to the
normality assumption when the sample size is large
enough.

1. Samples are drawn from a normally
distributed population.
2. The observations in the sample are
independent of one another.

5

6

1

10/13/2018

Assume the population distribution is Exponential.
With 𝜆= 1.

1000 CIs for the Mean of an
Exponential(1) Distribution: n = 10
Note the
Right Skew!

Note the
Right Skew!

7

1000 CIs for the Mean of an
Exponential(1) Distribution: n = 100

8

Given Data, How Do We Check the
Normality Assumption? Visually!
n = 100

Note the
Right Skew!

Histogram

n = 100

q-q Plot

Note the greater
symmetry and
smaller standard
deviation.
9

Normal q-q Plot
DATA
41.2
76.6
109.3
134.5
148.6

data
41.2
76.6
109.3
134.5
148.6

rank
1
2
3
4
5

middle = (rank +
previous rank)/2n
0.1
0.3
0.5
0.7
0.9

standard normal
hypothetical value
based on middle
-1.28
-0.52
0.00
0.52
1.28

hypothetical data if data
were perfectly normal
46.09
79.15
102.04
124.93
157.99

10

Normal q-q Plot
z-score of data
= (data -xbar)/s
-1.39
-0.58
0.17
0.74
1.07

102.04 =xbar
43.65459 =s
5 =n

Q-Q plots are constructed differently depending
on the software or textbook, but usually include
some combination of the above columns. If the
graph plots green vs. green or orange vs. orange,
if the data is normal, then points should fall close
to the line y=x. If one green and one orange are
used, if the data is normal, the points should fall
along a straight line, but not necessarily one with
slope=1. Different software will calculate this line
differently.
11

12

2

10/13/2018

Given Data, How Do We Check the
Normality Assumption? Visually!
n = 100

Given Data, How Do We Check the
Normality Assumption? Visually!

n = 100

Histogram

n = 15

q-q Plot

n = 15

Histogram

Not normal! Data is skewed to the right and does not fall along a straight line in this q-q
13
plot.

q-q Plot

Data comes from a normal distribution, but it is hard to tell given the small sample size.

14

Given Data, How Do We Check the
Normality Assumption? Visually!

Beware of small sample sizes!
n = 15

n = 15

n = 15

Histogram

n = 15

q-q Plot

It looks like the data might not be normal (skew, curvature of q-q plot), but it is
hard to tell with this small sample size.

Histogram

15

The histogram shows an almost bimodal distribution (definitely not normal), but again it is
16
hard to tell with small sample sizes. The q-q plot does not look too far away from normality.

A Way to Decide:
Small Sample Size

Large Sample Size

Little to no Evidence
Against Normality

No Problem if you feel
Normality is a safe
assumption … run the TTest. (You may want to
be “conservative” here
and run a test with
fewer assumptions.)

No Problem!
Run the T-Test

Significant Evidence
Against Normality

Assumptions are not
met and test is not
robust here … Try a
transformation and, if
appropriate, run a t-test.
If not appropriate, do
NOT run the T-Test and
proceed to a test with
fewer / different
assumptions.

No Problem .. You have
the Central Limit
Theorem. Run the TTest.

q-q Plot

A Complete Analysis:
Statement of the Problem
Address the Assumptions
Perform the Appropriate Test (5 Steps)
Step 6: Provide a conclusion that a non
statistician can understand, include a p-value
and confidence interval.
• Scope of Inference
•
•
•
•

17

18

3

10/13/2018

Example: Beach Comber

Example: Comber

The following are ages of 7 randomly chosen patrons seen leaving
the Beach Comber in South Mission Beach at 7pm! We assume
that the data come from a normal distribution and would like to
test the claim that the mean age of the distribution of Comber
patrons is different than 21.

25, 19, 37, 29, 40, 28, 31

PROBLEM STATEMENT:
Test the claim that the mean age of Beach Comber patrons at 7pm is different from
21.
ASSUMPTIONS:
Normal Population Distribution: Judging from the histogram and q-q plots,
there is little to no evidence that the population distribution of patron ages at
the Comber at 7pm is not normal. We will assume that this distribution is
normal and proceed.
Independence: These subjects were randomly selected from the population;
thus, we will assume that the observations are independent.

19

20

Revised Write Up!

Example: Bats

We would like to test the claim that the population mean is different from 21. To do this,
we take a sample of size n = 7 and find that 𝑥̅ = 29.86 years and s = 7.09 years.

Ho: µ = 21
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Ha: µ ≠ 21

Step 2: Draw and Shade and Find the Critical Value.

𝑡=

Step 3: Find the test statistic. (The t value for the data.)
Step 4: Find the p-value: P-value = .0162 < .05

29.86 − 21
𝑥̅ − 𝜇
𝑠 =
7.09
𝑛
7

= 𝟑. 𝟑𝟏

Step 5: REJECT Ho
Step 6: There is sufficient evidence to conclude that the true mean age of patrons at the
Comber at 7pm is different from 21 (p-value =.0162 from a t-test). A 95% confidence
interval for the mean age is (23.3, 36.4) years. Scope: Since this was a random sample, we
can generalize these findings to the entire population of Comber patrons at 7pm. Note that
we have evidence to support the claim that the mean age is greater than 21 as well.

22

Example: Bats

H0: 𝝁= 1.8
H1: 𝝁 ≠ 1.8
𝜶 = 0.05
𝒙= 1.713
s = .2588

PROBLEM STATEMENT:
Test the claim that the mean weight of the bumble bee bat is different from 1.8 g.
ASSUMPTIONS:
Normal Population Distribution: Judging from the histogram and q-q plots, there
is some visual evidence of a departure from normality. With a sample size of 15
and no extreme outliers, we will assume the distribution of sample means is
decently approximated by a normal distribution via the CLT and proceed with
caution.
Independence: Not much is known about the sampling scheme used to obtain this
sample. We will assume the observations are independent.

Critical Values

Test Statistic
t = -1.297

t = ± 2.145

P-value: .2155 > .05

Fail to Reject H0

On the basis of this test, there is not enough evidence to reject the claim that the mean weight of
bumblebee bats is equal to 1.8 g (p-value = .2155 from a t-test). A 95% confidence interval is (1.57, 1.8566)
grams. The problem was ambiguous on the randomness of the sample; thus, we will assume that it was
not a random sample, which makes inference to all bats strictly speculative.
23

24

4

10/13/2018

Assumptions of one and two
sample T-Tests

What happens if the normality
assumption is broken?

Many times ….
NO PROBLEM!!!

1. Samples are drawn from a normally
distributed population.
2. If it is a two sample test, both populations
are assumed to have the same standard
deviation (same shape).
3. The observations in the sample are
independent of one another.

x

x
𝐶𝑒𝑛𝑡𝑎𝑙 𝐿𝑖𝑚𝑖𝑡 𝑇ℎ𝑒𝑜𝑟𝑒𝑚

𝑥̅

𝑥̅
𝜇

𝜇
25

When data is not normal

26

2. In a two sample test, both populations are
assumed to have the same standard deviation
(same shape).
𝐴𝑠𝑠𝑢𝑚𝑒: 𝜎 = 𝜎

𝜇

𝜇
𝑊𝑒 𝑤𝑎𝑛𝑡 𝑖𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑜𝑛 ∶ 𝜇 − 𝜇

27

Evidence of Inequality of Variance:
VISUAL

28

Evidence of Inequality of Variance:
F-Test for Equal Variance
Ho: population variances are equal
Ha: population variances are not equal

There is not sufficient evidence to conclude the variances are different (p-value =
.4289 from a F-Test.)
Little visual evidence against equal standard deviations (variances).
29

30

5

10/13/2018

Evidence of Inequality of Variance:
VISUAL

Evidence of Inequality of Variance:
F-Test for Equal Variance
Ho: population variances are equal
Ha: population variances are not equal

There is not sufficient evidence to conclude the variances are different (p-value =
.1043 from a F-Test.)

Strong visual evidence against equal standard deviations (variances).
31

Evidence of Inequality of Variance:
F-Test / VISUAL
The F-test has a strong assumption that the two
populations that it is testing the variances of must
be normal. It is not robust to this assumption.
Since the second distribution has strong evidence of
right skew, the F-test for Equal Variance is not
appropriate here.
For this example, the visual evidence is so strong
that we would not need to consult a hypothesis
test to test this assumption of equal variances.

32

What happens if the assumption of
equal variances (standard deviations)
is broken?
In some circumstances ….
This could be serious …. In others…..
No Problem!

However, later in the semester we will study a test of spread/dispersion that does not
have this assumption and can be used in a wider range of statistical environments.

33

When variances are not equal

34

The Take Away
What you will find in practice will most likely not fit exactly into the scenarios
identified here. There will be some judgment involved … this is the “art” of
statistics.
Here are some general rules of thumb that we will assume this semester.
1. If sample sizes are the same and sufficiently large, the t tools (tests and
confidence intervals) are valid … since they are robust to the violation of
normality.
2. If the two populations have the same standard deviation, then the t tests
are valid … given sufficient sample sizes.
3. If the standard deviations are different and the sample sizes are different
then the t tools are not valid and another procedure should be used.
(Ch. 4)

35

36

6

10/13/2018

FULL EXAMPLE: CREATIVITY STUDY!

A Complete Analysis:

We would like to test the claim that the mean score of the Intrinsic group is different than that
of the Extrinsic group. To do this we take a sample of size nI = 24 and nE = 23 and find that 𝑥̅ I =
19.88 points, 𝑥̅ E = 15.74, sI = 4.44, and sE= 5.25 points.

Statement of the Problem
Address the Assumptions
Perform the Appropriate Test (5 Steps)
Step 6: Provide a conclusion that a non
statistician can understand. Include a p-value
and confidence interval
• Scope of Inference
•
•
•
•

Step 1: Identify the null (Ho) and alternative (Ha) hypothesis.

Ho: µ𝐼 = µ𝐸
Ha: µ𝐼 ≠ µ𝐸
Which is equivalent to:

Ho: µ𝐼 − µ𝐸 = 0
Ha: µ𝐼 − µ𝐸 ≠0
37

Full Example: Creativity Data

First Check …. q-q Plot

State the Problem: We would like to test the claim
that the mean score of the Intrinsic group is
different than that of the Extrinsic group.
Check Assumptions:
1. Normally Distributed Populations

The q-q plots for both populations look sufficiently
normal. We look at the histograms as well … but there is
not sufficient evidence here to suggest that they are not
normal.
39

Histograms

• Keeping in mind the relative small sample size from each
population, we do not observe any extreme outliers and
observe a pretty strong bell shape which lends evidence to
support normality of the populations.
41

40

Normality Assumption

Visual inspection of the histograms and q-q plots of each
population are consistent with the normality of each
population. We assume normality and move on to the second
assumption.

42

7

10/13/2018

Equality of Variances

Full Example: Creativity Data
State the Problem: We would like to test the claim that
the mean score of those with intrinsic motivation is the
same for those with extrinsic motivation.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations

A visual check was done by looking at the histograms, which reveal similar shapes and
support the equal variances assumption. You can assume equal variances here.

43

Full Example: Creativity Data

Since we are able to assume normal population distributions, we can use the F-Test to provide
secondary evidence if the visual is inconclusive. Since the p-value is greater than our
significance level of alpha = 0.05, we fail to reject the null hypothesis of equality (p-value =
44
0.1043) and conclude that there is not enough evidence to suggest the variances are different.

Independent Observations

State the Problem: We would like to test the claim
that the mean score of those with intrinsic
motivation is the same for those with extrinsic
motivation.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations
3. Independent Observations

The sample consisted of volunteers and thus
subjects may not be independent of one
another. However, we will assume
independence and proceed with caution.

45

46

Let’s Formalize This Test Into 6 Steps!

Full Example: Creativity Data
State the Problem: We would like to test the claim that
the mean intrinsic score is the same as the extrinsic score.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations
3. Independent Observations
Run the Test:
1. First 5 steps.

We would like to test the claim that the mean score of the Intrinsic group is different than that
of the Extrinsic group. To do this we take a sample of size nI = 24 and nE = 23 and find that 𝑥̅ I =
19.88 points, 𝑥̅ E = 15.74, sI = 4.44, and sE= 5.25 points.
Ho: µ𝐼 − µ𝐸 = 0
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Ha: µ𝐼 − µ𝐸 ≠0

Step 2: Draw and Shade and Find the Critical Value.

Step 3: Find the test statistic. (The t value for the data.)

𝑡=

(𝑥𝐼 − 𝑥𝐸 )
𝑠𝑝 1 + 1
𝑛𝐼 𝑛𝐸

= 2.93

Step 4: Find the p-value: P-value 0.0054< .01
Step 5: Key! The sample mean we found is very unusual under the
assumption that the group means are equal (µ𝐼 − µ𝐸 ). So we Reject
this assumption. That is, we REJECT Ho.

47

8

10/13/2018

Let’s Fill in the P-value (and add a CI)!

We would like to test the claim that the mean score of the Intrinsic group is different than that
of the Extrinsic group. To do this we take a sample of size nI = 24 and nE = 23 and find that 𝑥̅ I =
19.88 points, 𝑥̅ E = 15.74, sI = 4.44, and sE= 5.25 points.
Ho: µ − µ = 0

Full Example: Creativity Data

𝐼

State the Problem: We would like to test the claim that
the mean intrinsic score is the same as the extrinsic score.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations
3. Independent Observations
Run the Test:
1. First 5 steps.
State the Scope and Conclusion.

49

𝐸

Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Ha: µ𝐼 − µ𝐸 ≠0
Step 2: Draw and Shade and Find the Critical Value.

Step 3: Find the test statistic. (The t value for the data.)
Step 4: Find the p-value: P-value = .0054
Step 5: REJECT Ho

𝑡=

(𝑥𝐼 − 𝑥𝐸 )
𝑠𝑝 1 + 1
𝑛𝐼 𝑛𝐸

= 2.93

Step 6:
Conclusion: There is sufficient evidence to suggest that those who receive the Intrinsic
treatment have a higher mean score than those who receive the Extrinsic treatment (p-value =
.0054 from a two sided t-test). A 99% confidence interval for this difference is (1.29, 7.00).
SCOPE: Since this was a randomized experiment, we can conclude that the Intrinsic treatment
caused this difference. However, since the study was of volunteers, this inference can only be
generalized to the 47 participants.

Happiness Study

LET’S TRY SOME!
For each of these data sets, write up the assumption statement with
respect to checking the assumptions for a one or two sample t-test.
You may assume the data to be independent.
Happiness Data Set
Mice Experiment Data Set

5 randomly selected people were asked to rate their happiness on a scale from 1 – 100
on a cloudy day and 8 randomly selected people were asked the same question on a
sunny day.
QOI: Is the mean happiness of individuals different on a cloudy day than a sunny day?
If possible, can we test if cloudy weather causes a change in happiness?

All data sets can be found in one file in this week’s materials.
You will need to add the proc ttest statement for each.
However, you will not need the data for this exercise.

Address each assumption of the two sample t-test and then decide if the two-sample ttest is appropriate to answer this QOI with this data.
51

Happiness Study

Mice Study

Normality of Distributions: Judging from the histograms and q-q plots, there is
evidence of outliers in both the Cloudy and Sunny sets. The most pronounced
outlier seems to be in the Sunny data set; thus, there is significant visual evidence
against these data being normally distributed. In addition, we are not satisfied that
the t-test will be robust to this assumption since the sample sized are so small.

A large sample of mice were randomly assigned to receive a drug or a placebo (sample
size nD = 32 and nP = 32). The mice’s tcell counts were then taken and histograms and
q-q plots are displayed above.

Equal Standard Deviations: Judging from the histograms, q-q plots and box plots,
there is significant visual evidence that the standard deviations are different. In
addition, since the sample sizes are different we know that the t-test is not robust to
this assumption.
Independence: We will assume that these data are independent.
The two sample t-test is not appropriate here. We should look for a different test.

52

53

QOI: Is the mean tcell count of mice that receive the drug greater than that of the
mice that receive the placebo?
Can we draw draw evidence of causality from this study?
Address each assumption of the two sample t-test and then decide if the two-sample ttest is appropriate to answer this QOI with this data.
54

9

10/13/2018

Mice Study

Normality of Distributions: Judging from the histograms and q-q plots, there is
significant visual evidence to suggest the data come from right skewed distributions.
However, since the sample size is large nD = 32 and nP = 32 the t-test is robust to this
assumption violation.

Transformations

Equal Standard Deviations: There is strong visual evidence to suggest that the data
come from distributions with different standard deviations. However, since we have
the same sample size in each group, the t-test is robust to this assumption violation,
by a previous “rule of thumb”.
Independence: We will assume that these data are independent.
The two sample t-test is appropriate here.

55

56

Log Transformation

Appropriate Interpretations After a Log
Transformation –
Example Write Ups….
Observational Study:
“It is estimated that the median for population X is
exp(mean(log(x)) – mean(log(y))) times as large as
the median for population Y.”
Randomized Experiment:
“It is estimated that the median response of an
experimental unit to treatment x will be
exp(mean(log(x)) – mean(log(y))) times as large as
its response to treatment y.”

Cloud Seeding!

Does Cloud Seeding Work?
On days that were deemed suitable for cloud seeding, a
random mechanism was used to decide whether to seed
the target cloud on that day or to leave it unseeded as a
control. Precipitation was measured as the total rain
volume falling from the cloud base following the airplane
seeding run, as measured by radar. We would like to test at
the alpha = .05 level of significance whether cloud seeding
is effective in increasing precipitation.

10

10/13/2018

After Log Transformation

Cloud Seeding: Original Data

T Test and Confidence!!!
H0: Cloud Seeding does not work.
H1: Cloud Seeding does work.

H0: Medianseeded = Medianunseeded
H1: Medianseeded > Medianunseeded

e0.3904 = 1.5,
e1.8972 = 6.7

Cloud Seeding Book Example
Original

0

500

1000

1500

Rainfall (acre-feet)

2000

2500

Figure 1. Box Plots of Cloud Seeding Data.

Unseeded

Logged

Seeded

Log(Rainfall)

For confidence interval.

It is estimated that the median volume of rainfall on days when clouds were seeded was e1.1438=3.1 times as large as
when not seeded (p-value = .007). A 90% confidence interval for this multiplicative effect on the median is 1.5 to 6.7
times. Since randomization was used to determine whether any particular suitable day was seeded or not, it is safe to
interpret this as evidence that the seeding caused the larger median rainfall.

Recap: The Take Away

0

2

For the one sided test.

4

6

8

Figure 2. Box Plots of Log-Transformed Cloud Seeding Data.

Unseeded

Seeded

Appendix

What you will find in practice will most likely not fit exactly into the scenarios
we identified here. There will be some judgment involved … this is the “art”
of statistics.
Here are some general rules of thumb that we will assume this semester.
1. If sample sizes are the same and sufficiently large, the t tools (tests and
confidence intervals) are valid … since they are robust to the violation of
normality.
2. If the two populations have the same standard deviation then the t tests
are valid … given sufficient sample sizes.
3. If the standard deviations are different and the sample sizes are different
then the t tools are not valid and another procedure should be used.
(Ch. 4)

65

66

11

10/13/2018

Log Transformations: Theory
Prop 1:

Log(x)

Prop 3:

Log(y)

Because data is
now symmetric
(median =mean)

Mean[log(x)] = Median[log(x)]

The logarithm is a
monotonically increasing
function. If X1 > X2 then
log(X1) > log(X2).
Therefore consider X1 through
X5 in ascending order so that
X1 < X2 < X3 < X4 < X5.
Then log(X1) < log(X2) <
log(X3) < log(X4) < log(X5).

X

Log(X)

X1

log(X1)

X2

log(X2)

X3

log(X3)

X4

log(X4)

X5

log(X5)

𝑒 =

Log (base 10) Transformations: Theory
Prop 1:

Prop 4a:
𝑒

( )

=𝑋

Derivation:
𝑀𝑒𝑎𝑛(log 𝑋 ) − 𝑀𝑒𝑎𝑛(𝑙𝑜𝑔 𝑌 ) = 𝛿
Diff of means on log scale
𝑀𝑒𝑑𝑖𝑎𝑛(log 𝑋 ) − 𝑀𝑒𝑑𝑖𝑎𝑛(𝑙𝑜𝑔 𝑌) = 𝛿 Prop 1
log 𝑀𝑒𝑑𝑖𝑎𝑛 𝑋 − log(𝑀𝑒𝑑𝑖𝑎𝑛 𝑌) = 𝛿 Prop 2
=𝛿

( )
( )

=𝑋

68

𝑋
log 𝑋 − log 𝑌 = log ( )
𝑌

Prop 2:

=

( )

10

log(Median(X)) = Median(log(X))

log(Median(X)) = Median(log(X))

( )
( )

=𝑋

log(Median(X)) = log(X3) = Median(log(X))

Prop 3:

Mean[log(x)] = Median[log(x)]

Therefore:
log
𝑒 =𝑒

Prop 4b:

67

Prop 1:

( )
( )

𝑒

( )

e is a pretty remarkable number!:

Log (base e) Transformations: Theory

log

Prop 4a:

𝑋
log 𝑋 − log 𝑌 = log( )
𝑌

Mean[log(y)] = Median[log(y)]

Prop 2:

Log Transformations: Theory

y

x

( )
( )

Prop 2:

Prop 4b:

log(Median(X) = Median(log(X))

10

( )

=𝑋

Derivation:
𝑀𝑒𝑎𝑛(log 𝑋 ) − 𝑀𝑒𝑎𝑛(𝑙𝑜𝑔 𝑌 ) = 𝛿
Diff of means on log scale
𝑀𝑒𝑑𝑖𝑎𝑛(log 𝑋 ) − 𝑀𝑒𝑑𝑖𝑎𝑛(𝑙𝑜𝑔 𝑌) = 𝛿 Prop 1
log 𝑀𝑒𝑑𝑖𝑎𝑛 𝑋 − log(𝑀𝑒𝑑𝑖𝑎𝑛 𝑌) = 𝛿 Prop 2
( )
( )

log

Prop 4a

Therefore:
log
10 = 10 10
10 =

Prop 3

=𝛿

( )
( )

( )
( )

=

( )
( )

Prop 4b

Full Example: SSHA Data

The Survey of Study Habits and Attitudes (SSHA) is a psychological test designed
to measure the motivation, study habits, and attitudes toward learning of college
students. These factors, along with ability, are important to explain success in
school. Scores on the SSHA range from 0 to 200. A selective private college gives
the SSGA to an SRS of both male and female first-year students.
The data for the women are as follows:
156 109 137 115 152 140 154 178 111 123 126 126 137 165 129 200 150
140 116 120 130 131 130 140 142 117 118 145 130 145
The data for men are as follows:
118 140 114 180 115 126 92 169 139 121 132 75 88 113 151 70 115 187
114 116 117 145 149 150 120 121 117 129 92 110
Most studies have found that the mean SSHA score for men is lower than the mean
score in a comparable group of women. Test this claim at the alpha = .05 level of
significance. (Show all 6 steps.)

H0: w = m
H1: w > m

𝑋
log 𝑋 − log 𝑌 = log ( )
𝑌

Prop 3

FULL EXAMPLE: SSHA Data

Prop 3:

Mean[log(x)] = Median[log(x)]

71

State the Problem: We would like to test the claim
that the mean SSHA score of men is less than that
of women.
Check Assumptions:
1. Normally Distributed Populations

72

12

10/13/2018

First Check …. q-q Plot

Histograms

The q-q plots for both populations look sufficiently
normal. We look at the histograms as well … but there is
not sufficient evidence here to suggest that they are not
normal.
73

Normality Assumption

• Keeping in mind the relative small sample size from each
population, we do not observe any extreme outliers and
observe a pretty strong bell shape which lends evidence to
support normality of the populations.
74

Full Example: SSHA Data
State the Problem: We would like to test the claim that
the mean SSHA score of men is less than that of women.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations

Visual inspection of the histograms and q-q plots of each
population is consistent with the normality of each
population. We assume normality and move on to the second
assumption.

75

Equality of Variances

A visual check was done by looking at the histograms which reveal similar shapes and
support the equal variances assumption. You can assume equal variances here.

Since we are able to assume normal population distributions, we can use the F-Test to provide
secondary evidence if the visual is inconclusive. Since the p-value is greater than our
significance level of alpha = 0.05, we fail to reject the null hypothesis of equality (p-value =
0.1043) of variances and conclude that there is not enough evidence to suggest the variances
77
are different.

76

Full Example: SSHA Data
State the Problem: We would like to test the claim
that the mean SSHA score of men is less than that
of women.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations
3. Independent Observations

78

13

10/13/2018

Independent Observations

Full Example: SSHA Data
State the Problem: We would like to test the claim that
the mean SSHA score of men is less than that of women.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations
3. Independent Observations
Run the Test:
1. First 5 steps.

The sample was indeed a SRS (simple random
sample) from the population of the selective
private college, therefore we assume the
observations are independent of one another.

79

80

Run The Two Sample T-Test!!!

Critical Value

• There is no reason to pair these observations and
we have two samples …. Therefore we should use
the two sample t-test with pooled standard
deviation since we are assuming the population
standard deviations are equal. We are testing
here:

𝛼 = .05 = significance level.

𝑥̅ W - 𝑥̅ M

.05

df = 60 – 2 = 58

0
𝑡.

,

= 1.67

H0: 𝜇 = 𝜇
H1: 𝜇 > 𝜇
81

82

Let’s Formalize This Test Into 6 Steps!

Two Sample T-Test … SAS Output

We would like to test the claim that the mean SSHA score of the men is less than the mean
score of women. To do this we take a sample of size nM = 30 and nW = 30 and find that 𝑥̅ M =
124.2 points, 𝑥̅ W = 137.1 and sM = 27.2 sW= 20.2 points.
Ho: µ𝑊 − µ𝑀 = 0
Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Ha: µ𝑊 − µ𝑀 > 0

Step 2: Draw and Shade and Find the Critical Value.

Step 3: Find the test statistic. (The t value for the data.)

𝑡=

(𝑥̅ 𝑊 − 𝑥̅ 𝑀)
𝑠𝑝 1 + 1
𝑛𝐼 𝑛𝐸

= 2.08

Step 4: Find the p-value: P-value = .0211
Step 5: REJECT Ho.

83

14

10/13/2018

Full Example: SSHA Data

Scope

State the Problem: We would like to test the claim that
the mean SSHA score of men is less than that of women.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations
3. Independent Observations
Run the Test:
1. First 5 steps.
State the Scope and Conclusion.

Since the study is between women and men, the
subjects cannot be randomly assigned to the
two groups, and we have an observational
study. For this reason, we cannot make any
causal inference and must limit our conclusions
to differences of group means.
However, the sample was an SRS and thus any
results can be inferred back to the population of
students at this particular private college.
85

86

Two Sample T-Test … SAS Output

Conclusion
There is sufficient evidence to support the claim at the α=.05 level of significance (pvalue = .0211) that the mean SSHA score is lower for men than for women at this
college. A 95% one side confidence interval for this difference is (2.5238 points, ∞.)

Scope of Inference: Since the study is between women and men, the subjects
cannot be randomly assigned to the two groups, and we have an observational
study. For this reason, we cannot make any causal inference and must limit our
conclusions to differences of group means.
However, the sample was an SRS, and thus any results can be inferred back to
the population of students at this particular private college.
87

88

FULL EXAMPLE: Promotion Data

ANOTHER FULL EXAMPLE

The Revenue Commissioners in Ireland conducted a contest for promotion.
The ages of the unsuccessful and successful applicants are given below.
Some of the applicants who were unsuccessful in getting the promotion
charged that the competition involved discrimination based on age. Treat
the data as samples from larger populations and use a .05 significance level
to test the claim that the unsuccessful applicants are from a population with
a greater mean age than the mean age of successful applicants. Based on
the result, does there appear to be discrimination based on age? (Show all
6 steps.) Assume all data comes from a normally distributed population.
Unsuccessful Applicants:
34 37
37
38
45
60
62
55
Successful Applicants
27 33
36
37
43
44
80
46
51
51

89

41
46
56

42
65
70

43
49
64

44
65

44
53

45
54

38
44
47
52

38
44
75
54

39
45
48

42
70
72

42
71
49

43
72
49

H0: U = S
H1: S < U

90

15

10/13/2018

Full Example: Promotion Data

First Check …. q-q Plot
Successful

Unsuccessful

State the Problem: We would like to test the
claim that the mean of the successful group is
less than the mean of the unsuccessful group.
Check Assumptions:
1. Normally Distributed Populations
The q-q plot for the successful data provides some
evidence of non normality, while the q-q plot for the
unsuccessful data looks consistent with normally
distributed data.
91

Histograms

•

92

Normality Assumption

The successful group (top) has a clear right skew to the data, while the unsuccessful group shows a
possible mild right skew. This suggests that both sets of data may be from right skewed
populations. We know that the t-tools are robust to non normality for these types of distributions
so we proceed with the t test…. We will readdress these concerns when we talk about the standard
deviation.
93

Visual Inspection of the histograms and q-q plots indicates the
both data sets may be from a right skewed distribution. We
know that the t-tests are robust to violations of the normality
assumption when the data are from a right skewed
distribution (when the sample size is sufficient), so we proceed
with the t-test.

94

Equality of Variances

Full Example: Promotion Data
State the Problem: We would like to test the
claim that the mean of the successful group is
less than the mean of the unsuccessful group.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations

A visual check was done by looking at the histograms, which reveal similar shapes and
support the equal variances assumption. We will assume equal variances here.

95

As secondary evidence of the visual is inconclusive, given that the p-value is greater than
our significance level of alpha = 0.05, we fail to reject the null hypothesis of equality of
variances (p-value = 0.2286) and conclude that there is not enough evidence to suggest the
96
variances are different.

16

10/13/2018

Full Example: Promotion Data

Independent Observations

State the Problem: We would like to test the
claim that the mean of the successful group is
less than the mean of the unsuccessful group.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations
3. Independent Observations

The sample was indeed a SRS (simple random
sample) from the population of the selective
private college, therefore we assume the
observations are independent of one another.

97

98

Full Example: Promotion Data

Run The Two Sample T-Test!!!

State the Problem: We would like to test the claim that
the mean of the successful group is less than the mean of
the unsuccessful group.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations
3. Independent Observations
Run the Test:
1. First 5 steps.

• There is no reason to pair these observations,
and we have two samples. Therefore, we should
use the two sample t-test with a pooled standard
deviation, since we are assuming the population
standard deviations are equal. We are testing
here:

H0: s = u
H1: s < u

99

Two Sample T-Test … SAS Output

Full Example: Promotion Data
State the Problem: We would like to test the claim that
the mean of the successful group is less than the mean of
the unsuccessful group.
Check Assumptions:
1. Normally Distributed Populations
2. Equal Standard Deviations
3. Independent Observations
Run the Test:
1. First 5 steps.
State the Scope and Conclusion.

H0: s = u
H1: s < u
Fail to reject the null
hypothesis at 0.05 level.

100

101

102

17

10/13/2018

SCOPE

Conclusion

Since the study is between successful and
unsuccessful candidates for a promotion, subjects
cannot be randomly assigned to the two groups,
and we have an observational study. For this
reason we cannot make any causal inference and
must limit our conclusions to differences of group
means.
However, the sample was an SRS and thus any
results can be inferred back to candidates for
promotion from the population that the Revenue
Commissioners of Ireland sampled.

There is not sufficient evidence to support the
claim at the α=.05 level of significance (p-value
= .4357) that the mean age of those who were
given a promotion is lower than those who
were not given the promotion in this . A 90%
confidence interval for this difference is (-6.3
points, 5.2 points.)

103

104

18

Part IV

Alternatives to the t tools

98

Chapter 15

Problem 2: Logging problem
We are doing rank sum analysis

15.1

Complete Rank-Sum Analysis Using SAS

Problem Statement
We would like to test the claim that logging burned trees increased the percentage of seedlings lost in the Biscuit
Fire region from 2004 to 2005.

Assumptions
Independence
The two-sample Wilcoxon Rank-Sum test assumes that the samples are independent. In this case, the two sets of
tree plots are independent of each other, the amount of tree seedlings in one plot is not directly related to the
amount of tree seedlings in another, if it is, it is not a tangible amount of dependence. Therefore, we can assume
independence. We can also assume ordinality with numericla data

Statement of the Hypothesis
Our null hypothesis, H0 , is that the distribution of percent of saplings lost in the logged plots is less than or equal
to the distribution of percent of saplings lost in the unlogged plots. Our alternative hypothesis, H1 , is that the
distribution of percent of saplings lost in the logged plots is greater than the distribution of percent of saplings
lost in the unlogged plots. Mathematically speaking, we have:
H0 :meanRanklogged − meanRankunlogged ≤ 0

(15.1.1)

H1 :meanRanklogged − meanRankunlogged > 0

(15.1.2)

α = 0.05

(15.1.3)

The significance level, α, is:

Calculation of the P-value
To find the p value, I performed a Wilcoxon Rank-Sum test. Because the sample size is small, an exact test was used,
as there is no need for a normal approximation. The code used to perform the test is as follows:

99

Analysis Guide

Midterm

Code 15.1. Exact rank sum test using SAS
/* We want the wilcoxon test and the Hodges-Lehman Confidence Interval*/
proc NPAR1WAY data=loggingData Wilcoxon HL;
class Action;
Var PercentLost;
/* Because our sample size is small, we want to do an Exact test*/
Exact;
run;

The output of this code is displayed in Figure 2.1:
Figure 15.1.1. Results of the Rank-Sum Test on the Logging Data

The calculated p value is
p = 0.0058

(15.1.4)

p = 0.0058 < α = .05

(15.1.5)

Results of the Hypothesis Test
We have that:

Therefore, we Reject the Null Hypothesis There is sufficient evidence at the α = 0.5 significance level (p−value =
0.0058 for the exact test) to suggest that the distribution of percentages of saplings lost in the logged plots was
greater than the distribution of percentages of saplings lost.

Statistical Conclusion
MEDIANS FOR NONPAR The data provides convincing evidence that forest recovery is decreased in areas where
burned trees were logged. At a significance level of .05 (or even .01), the distribution/MEDIAN of the percentage
of saplings lost in the logged plots was greater than that of the unlogged areas. This was done with a one sided,
exact p-value of 0.0058. A range of plausible values (95 % confidence interval) for how much greater the median
loss of saplings was for the logged trees is [10.8,65.1], as displayed in Figure 2.2
100

Analysis Guide

Midterm

Figure 15.1.2. 95% Confidence Interval

Note that the negative of these values was taken, because this figure shows U nlogged − Logged.

Scope of Inference
This study was a random sample of trees in the plots, therefore we can make generalizations about all of the trees
in the 16 plots, and say that the areas which were logged had a greater loss of saplings and therefore recovered
more poorly than the unlogged areas. However, this was not a randomized experiment, and therefore we cannot
make causal inferences. That is, we cannot say that the logging of burnt trees caused the greater percent loss of
saplings.
Since the plots were not randomized to receive either the logging or not logging treatment, no causation can
be implied here. Since the transect patterns were randomly selected, this inference can be generalized to the 16
larger plots.

Confirmation Using R
In this section we confirm our findings using R. The R code input is shown below:
Code 15.2. wilcoxon rank sum test using R
1
2
3
4
5

loggingData <- read.csv("Data/ Logging .csv",header =TRUE , sep=",")
wilcox .test( PercentLost ~ Action ,
data = loggingData ,
exact = TRUE ,
alternative = " greater ")

And the output:
1

Wilcoxon rank sum test

2
3
4
5

data: PercentLost by Action
W = 55, p-value = 0.005769
alternative hypothesis : true location shift is greater than 0

The results of the two programs are identical!

101

Chapter 16

Problem 3: Welch’s Two Sample T-Test with
Education Data
16.1

Problem Statement and Assumptions

Problem Statement
We would like to examine the claim that the mean income of college educated people (16 years of education) is
greater than the mean income of people with only a high school education (12 years of education)

Assumptions
The code used to produce everything in this section is shown below:
Code 16.1. welch’s t test
proc ttest data=edudata order=DATA
sides=U; /*an Upper tailed test*/
class Educ;
var Income2005;
run;

Normality
Figure 3.1 shows histograms and Box plots relating to the data:

102

Analysis Guide

Midterm

Figure 16.1.1. Histograms and Box plots

As we can see from the figure, the data is not normal, it is heavily right skewed in both cases. Both the histograms
and the Box plots show this, as the histograms are way taller on the left side than on the right, while the box plots
show that there is a bunch of data on the left with a ton of outliers, clearly not normal. We examine this further with
the Q-Q plot in Figure 3.2
Figure 16.1.2. Q-Q Plot

The Q-Q plot conifrims our findings that the data is not very normal. However, the sample sizes are 400 and
1000, which means that we can definitely apply the central limit theorem. This means that we can treat the data as
normal, we will assume normality.
Independence
We will assume independence in this case.

16.2

Complete Analysis Using SAS

Statement of Hypotheses
H0 :µ16yeareduc − µ12yeareduc ≤ 0

(16.2.1)

H1 :µ16yeareduc − µ12yeareduc > 0

(16.2.2)

103

Analysis Guide

Midterm

Critical t Value
With α = .05 and a one sided test, the critical t value (with the appropriate degrees of freedom) is calculated using
the code shown below.
data critval;
p = quantile("T",.95,473.85); /*one sided test*/;
proc print data=critval;
run;

The critical t value is shown in Figure 3.3:
Figure 16.2.1. Critical t-value

The critical t value is t = 1.64. This is illustrated using the following bit of SAS code:
data pdf;
do x = -4 to 4 by .01;
pdf = pdf("T", x, 473.85);
lower = 0;
if x >= quantile("T",0.95,473.85) then upper = pdf;/*one sided*/
else upper = 0;
output;
end;
run;
title 'Shaded t distribution';
proc sgplot data=pdf noautolegend noborder;
yaxis display=none;
band x = x
lower = lower
upper = upper / fillattrs=(color=gray8a);
series x = x y = pdf / lineattrs = (color = black);
series x = x y = lower / lineattrs = (color = black);
run;

This produces Figure 3.4

104

Analysis Guide

Midterm

Figure 16.2.2. Shaded t Distribution

Calculation of the t Statistic
To calculate Welch’s t Statistic, we use the code seen in Section 3.a.2, giving us a t value of t = 9.98, as seen in
Figure 3.5
Figure 16.2.3. Results of Welch’s t-test

We see that in this case, we have a t-value of 9.98

Calculation of the p Value
We also see from Figure 3.5 that p = 0

Results of Hypothesis Test
We have that p = 0 < α = .05 and therefore we reject the null hypothesis

Conclusion
We have convincing evidence that the mean income of people with an education of 16 years is greater than the
mean income of people with an education of 12 years. A one sided p-value of zero shows us that the means are
truly different. The figure below shows a one sided 95% confidence interval on our data:

105

Analysis Guide

Midterm

Figure 16.2.4. Confidence Interval on the Difference of Means

The confidence interval on the difference of means is [27662.2, ∞). This estimates what is a plausible difference
between the means of the two samples. As we can see, the distribution of income of the sample with a 16-year
education is at least $27,000 greater than the distribution of income of the sample with a 12-year education.

Scope of Inference
This was an observational study; therefore, we cannot conclude that the extra education caused the change (increase) in mean incomes. Households were selected from a random sample of a previously selected “area of the
United States” and the subjects in this study are the members of those households. Therefore, since every member
of the “area” had the same chance of being selected, it is a random sample of the “areas.” However, no indication
is given on how the “areas” were selected. In conclusion, the association between education and income above
can be generalized to all the members of the “areas” that were selected for this study, but not generalized to the
U.S. as a who

Verification using R
The following R code was used to verify the analysis
1
2
3
4
5

eduData <- read.csv("Data/ EducationData .csv",header =TRUE , sep=",")
t.test( Income2005 ~ Educ ,
data = eduData ,
# we use less because R is doing 12 - 16 #
alternative = "less")

This gives the following output:
1

Welch Two Sample t-test

2
3
4
5
6
7
8
9
10

data: Income2005 by Educ
t = -9.9827 , df = 473.85 , p-value < 2.2e -16
alternative hypothesis : true difference in means is less than 0
95 percent confidence interval :
-Inf -27662.19
sample estimates :
mean in group 12 mean in group 16
36864.90
69996.97

Note that R is telling us that the distribution of income of the sample with a 12 year education is at least 27,000
less than those with a 16 year education

Preferences
I prefer the log transformed analysis, they both assume normality, however the log transformed analysis has the
more actually normal data to start with, and the variances are roughly equal. It also speaks more to the medians,
instead of the means, which is much more robust to the huge number of outliers. I think because of the outliers, I
definitely prefer the log method, as the mean is not such a good measurement with these crazy outliers.

106

Chapter 17

Problem 4: Trauma and Metabolic
Expenditure rank sum
17.1

Hand-Written Calculations

To summarize, T = 82, µ(T ) = 56, sd(T ) = 8.632 The handwritten work was done before the author understood
continuity correction, the continuity corrected Z and P values were calculated as follows:

Z=

(T − 0.5) − mean(T )
= 2.95
SD(T )
→ p =.001568

With a continuity correction of 0.5

107

(17.1.1)
(17.1.2)

Analysis Guide

17.2

Midterm

SAS verification

To verify the Z and p values calculated in Section 4.a, the following SAS code was run:
proc NPAR1WAY data=TraumaStudy Wilcoxon HL;
class PatientType;
Var MetabolicEx;
run;

The results of this code are shown in Figure 4.1
Figure 17.2.1. Continuity Corrected Wilcoxon Test Using SAS

The Results of the two tests are the same! Note that if you add the phrase ”correct=no” to the proc NPAR1WAY
statement, you get the same values as the non corrected ones in the handwritten work

17.3

Full Statistical Analysis

Problem Statement
We would like to test the claim that the Trauma patients had higher metabolic expenditures/

Assumptions
The Wilcoxon Rank-Sum test only assumes the data are independent, which in this case we will assume independence because the patients were not related to each other in any way, or at least their metabolic expenditures
aren’t dependent on the other people’s metabolic expenditures. ALSO obviously normal
110

Analysis Guide

Midterm

Hypothesis definitions
H0 :meanRankT rauma − meanRankN onT rauma ≤ 0

(17.3.1)

H1 :meanRankT rauma − meanRankN onT rauma > 0

(17.3.2)

In other words, the null hypothesis is that the nontrauma and trauma patients have equal distributions of metabolic
expenditures, while the alternative hypothesis claims that the distribution of the trauma patients’ metabolic expenditures is higher. We are using a one sided hypothesis test because that is what the book calls for. In this scenario,
we will say α = 0.05

Critical Value
The critical value was calculated using the following chink of SAS code:
data critval;
p = quantile("Normal",.95); /*one sided test*/;
proc print data=critval;
run;

Producing a critical t value of t = 1.64485
Figure 17.3.1. Critical Value

The critical value is shown on a normal distribution using the following bit of SAS code
data pdf;
do x = -4 to 4 by .01;
pdf = pdf("Normal", x);
lower = 0;
if x >= quantile("Normal",0.95) then upper = pdf;/*one sided*/
else upper = 0;
output;
end;
run;
title 'Shaded Normal distribution';
proc sgplot data=pdf noautolegend noborder;
yaxis display=none;
band x = x
lower = lower
upper = upper / fillattrs=(color=gray8a);
series x = x y = pdf / lineattrs = (color = black);
series x = x y = lower / lineattrs = (color = black);
run;

The shaded distribution is displayed in Figure 4.3

111

Analysis Guide

Midterm

Figure 17.3.2. Shaded Normal Distribution

Calculation of the z statistic
Our z statistic, calculated in Sections 4.a and 4.b is 2.95.

Calculation of the p value
Our p-value, calculated in Sections 4.a and 4.b is 0.0016

Discussion of the hypothesis
We Reject the null hypothesis, p = .0016 < 0.5 = α

Conclusion
We have convincing evidence that the distribution of metabolic expenditure of trauma patients is than the nontrauma patients (p=0.0016 on a one sided Wilcoxon rank-sum test). The figure below shows a 95% HodgesLehmann confidence interval on the difference of the two distributions:
Figure 17.3.3. 95% Confidence Interval

This tells us that a plausible difference between the two distributions is between 1.9 and 16.7. As we can see
this does not include the null hypothesis which says their difference is less than or equal to zero. This cannot give
us causal or population inferences because it was neither a randomized experiment nor a random sample ALSO
MEDIANS DUH

112

Chapter 18

Problem 5: Autism and Yoga signed rank
18.1

Hand-Written Calculations

The results of the calculations are as follows: S = 41, µS = 22.5, SDS = 8.4409, The Z value on the paper is incorrect,
as it does not correct for continuity. So, here we will aplply the continuity correction:
S − 0.5 − S̄
SDS

(18.1.1)

40.5 − 22.5
= 2.13 → poneT ail = .0166ptwoT ail = .033
8.4409

(18.1.2)

z=
z=

113

Analysis Guide

18.2

Midterm

Verification in SAS and R

Verification in SAS
To verify this, the following bit of SAS code was employed: Producing:
Code 18.1. Signed Rank test in SAS
data Autismdiff;
set Autism;
diff= Before-After;
run;
proc univariate data=Autismdiff;
var diff;
run;

Figure 18.2.1. Signed Rank Test In SAS

This two sided p value of 0.0313 is the same as a one sided p value of .01565, and a z value of 2.15. It is slightly
different with my calculations and SAS’s because they didnt use a normal approximation, I did.

Verification in R
This R code was employed for the same purposes:
AutismData <- read.csv("Data/ Autism .csv",header =TRUE , sep=",")
wilcox .test( AutismData \$Before , AutismData \$After ,
paired = TRUE ,
alternative = " greater ",
conf.int=TRUE)

1
2
3
4
5

Yielding:
Wilcoxon signed rank test with continuity correction

1
2

data: AutismData \$ Before and AutismData \$ After
V = 41, p-value = 0.01618
alternative hypothesis : true location shift is greater than 0
95 percent confidence interval :
4.999993
Inf
sample estimates :
( pseudo ) median
17.49993

3
4
5
6
7
8
9
10

The R code applied a continuity correction, instead of doing the exact permutation like SAS. Their P value corresponds with a Z score of 2.139

18.3

6 step Sign Rank test using SAS

Statement of Hypothesis
H0 :M edianBef ore − M edianAf ter ≤ 0

(18.3.1)

H1 :M edianBef ore − M edianAf ter > 0

(18.3.2)

We will say that α = .05 and we are doing a one sided test
115

Analysis Guide

Midterm

Critical Values
The critical value was calculated using the following chunk of SAS code:
data critval;
p = quantile("Normal",.95); /*one sided test*/;
proc print data=critval;
run;

Producing a critical t value of t = 1.64485
Figure 18.3.1. Critical Value

The critical value is shown on a normal distribution using the following bit of SAS code
data pdf;
do x = -4 to 4 by .01;
pdf = pdf("Normal", x);
lower = 0;
if x >= quantile("Normal",0.95) then upper = pdf;/*one sided*/
else upper = 0;
output;
end;
run;
title 'Shaded Normal distribution';
proc sgplot data=pdf noautolegend noborder;
yaxis display=none;
band x = x
lower = lower
upper = upper / fillattrs=(color=gray8a);
series x = x y = pdf / lineattrs = (color = black);
series x = x y = lower / lineattrs = (color = black);
run;

The shaded distribution is displayed in Figure 5.3
Figure 18.3.2. Shaded Normal Distribution

116

Analysis Guide

Midterm

Calculation of a Z statistic
We will use the Z statistic calculated using R/by hand,Z = 2.13, however it will not have a huge effect on the outcome
of the test

Calculation of a p value
For our z value, a one sided p value is p = 0.016.

Assessment of hypothesis
p = .016 < α = .05 →We reject the null hypothesis.

Conclusion
We have conclusive evidence that the median time to complete the puzzle for Autistic children is greater before
20 minutes of Yoga than after 20 minutes of Yoga. We cannot infer causality becuase this was not a randomized
experiment, and we cannot infer anything about the population because this was not a random sample. The median time for the children was at least 5 seconds longer before Yoga as compared to after Yoga, as seen by the
confidence interval displayed in the R output.

18.4

Paired t test in SAS

Statement of Hypothesis
H0 :µbef ore−af ter ≤ 0

(18.4.1)

H1 :µbef ore − af ter > 0

(18.4.2)

We will say that α = .05 and we are doing a one sided test.

Critical Values
The critical value was calculated using the following chunk of SAS code:
data critval;
p = quantile("T",.95,8); /*one sided test*/;
proc print data=critval;
run;

With the following output:
Figure 18.4.1. Critical Value

With a critical t value of t=1.86. This is demonstrated in a shaded t distribution with the following chunk of code:
data pdf;
do x = -4 to 4 by .01;
pdf = pdf("T", x,8);
lower = 0;
if x >= quantile("T",0.95,8) then upper = pdf;/*one sided*/
else upper = 0;
output;
end;
run;

117

Analysis Guide

Midterm

title 'Shaded Normal distribution';
proc sgplot data=pdf noautolegend noborder;
yaxis display=none;
band x = x
lower = lower
upper = upper / fillattrs=(color=gray8a);
series x = x y = pdf / lineattrs = (color = black);
series x = x y = lower / lineattrs = (color = black);
run;

The shaded distribution is displayed in Figure 5.5
Figure 18.4.2. Shaded T Distribution

Calculation of a t statistic
The T statistic was calculated using the following SAS code: The t value is shown in Figure 5.6
Code 18.2. Paired T test in SAS
proc ttest data=Autism alpha = .05 sides=U;
paired Before*After;
run;

Figure 18.4.3. Paired t statistic

We have a t value of 2.54.

Calculation of a P value
The p value can be seen in Figure 5.6: p = .0173

Assessment of Hypothesis
p = .0173 > α = .05 →we reject the null hypothesis.

Conclusion
We have conclusive evidence that the mean of the differences of times before and after the yoga is greater than
zero (p=.0173 on a one sided paired t test). A confidence interval for the mean of the difference of time for the
children to finish the puzzle before and after yoga is shown in Figure 5.7:
118

Analysis Guide

Midterm

Figure 18.4.4. 95% Confidence interval

This means that the mean of the differences was at least 4.9 seconds. We cannot infer causality because this
was not a randomized experiment, and we cannot make inferences about the population because this was not a
random sample. We also cannot make causal inferences with a paired t test

18.5

Confirmation with R

The R code below was used to verify the results of the previous section:
t.test( AutismData \$Before , AutismData \$After ,
paired = TRUE ,
alternative = " greater ",
conf.int=TRUE)

1
2
3
4

The output is presented below:
Paired t-test

1
2

data: AutismData \$ Before and AutismData \$ After
t = 2.5403 , df = 8, p-value = 0.01735
alternative hypothesis : true difference in means is greater than 0
95 percent confidence interval :
4.913201
Inf
sample estimates :
mean of the differences
18.33333

3
4
5
6
7
8
9
10

18.6

Complete Statistical Analysis

In this section, I will be using a paired t-test, because the data is pretty normal, as we will see in the following section.
When both are possible, I believe the paired t test is better because it doesnt mess with the data in any way, we
can see the magnitudes etc.

Assumptions
We can assume the differences are independent because the children did not affect the other children.
To check for normality we examine the following figure:

119

Analysis Guide

Midterm

Figure 18.6.1. Histogram and Box Plot

As we see from Figure 5.8, the data is fairly normally distributed. The histogram is heavier in the center than on
the edges, and the mean is near the median on the Box plot. We will examine this further in Figure 5.9
Figure 18.6.2. Q-Q Plot

As we can see, the data follows the line of normality closely, and therefore we can assume normality. This means
that a paired t test is appropriate.

Statement of Hypothesis
H0 :µbef ore−af ter ≤ 0

(18.6.1)

H1 :µbef ore − af ter > 0

(18.6.2)

We will say that α = .05 and we are doing a one sided test.

Critical Values
The critical value was calculated using the following chunk of SAS code:

120

Analysis Guide

Midterm

data critval;
p = quantile("T",.95,8); /*one sided test*/;
proc print data=critval;
run;

With the following output:
Figure 18.6.3. Critical Value

With a critical t value of t=1.86. This is demonstrated in a shaded t distribution with the following chunk of code:
data pdf;
do x = -4 to 4 by .01;
pdf = pdf("T", x,8);
lower = 0;
if x >= quantile("T",0.95,8) then upper = pdf;/*one sided*/
else upper = 0;
output;
end;
run;
title 'Shaded Normal distribution';
proc sgplot data=pdf noautolegend noborder;
yaxis display=none;
band x = x
lower = lower
upper = upper / fillattrs=(color=gray8a);
series x = x y = pdf / lineattrs = (color = black);
series x = x y = lower / lineattrs = (color = black);
run;

The shaded distribution is displayed in Figure 5.11
Figure 18.6.4. Shaded T Distribution

Calculation of a t statistic
The T statistic was calculated using the following SAS code:
proc ttest data=Autism alpha = .05 sides=U;
paired Before*After;
run;

The t value is shown in Figure 5.12

121

Analysis Guide

Midterm

Figure 18.6.5. Paired t statistic

We have a t value of 2.54.

Calculation of a P value
The p value can be seen in Figure 5.6: p = .0173

Assessment of Hypothesis
p = .0173 > α = .05 →we reject the null hypothesis.

Conclusion
We have conclusive evidence that the mean of the differences of times before and after the yoga is greater than
zero (p=.0173 on a one sided paired t test). A confidence interval for the mean of the difference of time for the
children to finish the puzzle before and after yoga is shown in Figure 5.13:
Figure 18.6.6. 95% Confidence interval

This means that the mean of the differences was at least 4.9 seconds. We cannot infer causality because this
was not a randomized experiment, and we cannot make inferences about the population because this was not a
random sample. We also cannot make causal inferences with a paired t test

122

Chapter 19

sexy ranked permutation test
Here is the SAS code I designed to conduct a Ranked permutation test I did not have time to add a normal curve
Code 19.1. handcrafted rank sum test
proc import
datafile='c:\Users\david\Desktop\MSDS\MSDS6371\Homework\Week4\Data\Trauma.csv'
out=TraumaStudy
DBMS=CSV;
run;
proc rank data=TraumaStudy out=Ranked ties=mean;
var MetabolicEx;
ranks rank;
run;
proc print data=Ranked;
run;
proc iml;
use Ranked var {PatientType rank};
/*making two groups in IML*/
read all var {rank} where(PatientType='Nontrauma') into g2;
read all var {rank} where(PatientType='Trauma') into g1;
obsdiff = sum(g1) - sum(g2);
print obsdiff;
call randseed(12345);
/* set random number seed
alldata = g1 // g2;
/* stack data in a single vector
N1 = nrow(g1); N = N1 + nrow(g2);
NRepl = 5000;
/* number of permutations
nulldist = j(NRepl,1);
/* allocate vector to hold results
do k = 1 to NRepl;
x = sample(alldata, N, "WOR");
/* permute the data */
nulldist[k] = sum(x[1:N1]) - sum(x[(N1+1):N]); /* difference of sums */
end;

*/
*/
*/
*/

title "Histogram of Null Distribution";
refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);";
call Histogram(nulldist) other=refline ;

pval = (1 + sum((nulldist) >= (obsdiff))) / (NRepl+1); /*this means one sided test, no a
print pval;
quit;

to my figure, however, the p value is more or less the same as the wilcoxon test however it is a more reasonable
number.

123

Analysis Guide

Midterm

Figure 19.0.1. Permutation Test

124

Chapter 20

Unit 4 lecture slides
Here it is

125

10/13/2018

Let’s Start With an Example
IBM gives each employee in the marketing department technical training
Based on further testing, it appears the traditional training method isn’t effective
Hence, a new training method is developed
Below are the test scores of 4 individuals who just finished the “New Method” and the last 3 test
scores from employees trained via the “Traditional Method” course
• Is there evidence to suggest that the “New Method” increases test scores?
•
•
•
•

Alternatives to (Student)
t-Tools

New Method
37
49
55
77

RANK SUM TE ST
W E LCH’S TE ST
SIGN TE ST / SIG NE D RANK TE ST

Traditional Method
23
31
46

2

Examining the t-Tools Assumptions
2

2

1

1
Which situation does it appear we are in?

1

Since the standard deviations appear (visual check) to be different and the sample sizes are both different and
exceptionally small, the t-test was not deemed appropriate and the nonparametric rank sum test was performed.

Using a t-test could have low power.
4

Nonparametric Methods
• A NONPARAMETRIC or DISTRIBUTION-FREE test doesn’t depend on underlying assumptions

Nonparametric Methods:
The Rank Sum Test

• This makes them ideal for use when the assumptions of non-nonparametric (that is, PARAMETRIC)
tests aren’t met
• The trade-off is that nonparametric methods perform somewhat worse than parametric
methods if the assumptions are approximately correct
• The first nonparametric method we will consider is the ”rank sum test”

5

1

10/13/2018

Rank Sum Test: Advantages

The Hypothesis Test

• No distributional assumptions
• Resistant to outliers
•Performs nearly as well as the t-test when the two populations are normal and considerably
better when there are extreme outliers
•Works well with ORDINAL (as opposed to interval data)
•Works with censored values
•It still requires some assumptions:
1.
2.

All observations are independent
The Y values are ordinal

(TWO SIDED)

59 patients with arthritis who participated in a clinical
trial were assigned to two groups, active and placebo.
The response status:
(excellent=5, good=4, moderate=3, fair=2, poor=1)
of each patient was recorded.

The Rank Sum test

(ONE SIDED)

The Sampling Distribution of …

• We can compute the rank sum test statistic using the following steps:
1.
2.
3.
4.

List all observations from both groups in increasing order
Note: n is the total # of observations
Assign each observation a rank, from 1 to n
If there are any ties, assign each tied observation’s rank to be the average of their ranks.
Identify each observation by its group

The Rank Sum Statistic!
Rank Sum test statistic (sum
of ranks of one group) is
approximately normally
distributed!

• The test statistic, T, is the sum of the ranks in one of the groups.
•We can find a p-value in two ways:
• Normal approximation
• Re-randomization (exact or approximate)

Rank-Sum Test: Normal Approximation

Rank Sum Test: randomly assign ranks
Name Order #
Bob
1
Sue
2
Fred
3
Jim
4
Pam
5
Tim
6
Zac
7

Group Rank
New
5
New
7
New
2
New
1
Trad
3
Trad
4
Trad
6

Name Order #
Sue
1
Bob
2
Fred
3
Jim
4
Pam
5
Tim
6
Zac
7

Group Rank
New
7
New
5
New
2
New
1
Trad
3
Trad
4
Trad
6

…

Name Order #
Pam
1
Tim
2
Sue
3
Zac
4
Fred
5
Bob
6
Jim
7

Group Rank
New
3
New
4
New
7
New
6
Trad
2
Trad
5
Trad
1

Record sum of ranks of one group (e.g. “Trad.”) for all 7! permutations of ranks. (7!=7*6*5*4*3*2*1=5040)
P-value is the number of permutations with a sum equal to or more extreme than the one in the original data
set divided by the total number of permutations.
*Could also do an approximate p-value by randomly choosing, say, 1000 orderings of the data.

2

10/13/2018

Rank-Sum Test:
Normal Approximation

Rank-Sum Test:
Normal Approximation

Common interpretation:
H0: The distribution of New Method Scores = The distribution of the Traditional Method Scores
H1:The distribution of New Method Scores > The distribution of the Traditional Method Scores

Common interpretation:
H0: The distribution of New Method Scores = The distribution of the Traditional Method Scores
H1:The distribution of New Method Scores > The distribution of the Traditional Method Scores

Technical mathematical interpretation:
H0: Average rank of New Method Scores = Average rank of all Scores (constant)
H1: Average rank of New Method Scores > Average rank of all Scores (constant)

There is mild evidence (alpha = 0.1) to suggest that the distribution of scores
from the “New” method is greater than the distribution of the “Traditional”
method (normal approximation to rank-sum test p-value = 0.0558).
There is mild evidence (alpha = 0.1) to suggest that the distribution of scores
from the “New” method is greater than the distribution of the “Traditional”
method (normal approximation to rank-sum test p-value = 0.0558).

Permutation Test
(Exact P-value)

Rank Sum Test (Wilcoxon)
H0:
H1:

The distribution of New Method Scores = The distribution of the Traditional Method Scores
The distribution of New Method Scores > The distribution of the Traditional Method Scores

Normal approximation p-values

There is sufficient evidence at the alpha = 0.1 level of significance (p-value = .0571
for the exact test) to suggest that the distribution of scores from four IBM
employees that were given the New Method is greater than the distribution of the
3 employees that took the test having had the Traditional Method of instruction.

Exact p-values

Cognitive Load Experiment

Cognitive Load Experiment

• Researchers compared the effectiveness of conventional textbook examples to modified ones
• They selected 28 ninth-year students who had no previous exposure to coordinate geometry
• The students were randomly assigned to one of two self study instructional groups, using conventional
and modified instructional materials
• After instruction, they were given a test and the time to complete one of the problems was recorded.
Is there sufficient evidence to suggest that the
cognitive load theory (modified instruction)
shortened response times?

(CENSORED

DATA)

3

10/13/2018

Cognitive Load Experiment

Cognitive Load Experiment:
Normal Approximation

With ties, the ranks are averaged.

(CONTINUITY

CORRECTION)

Statistical Conclusion: The data provide convincing evidence that a student could solve the problem more quickly after
the “modified” rather than the the “conventional” method (one-sided, normal approximation w/ C.C. p-value = 0.0013,
from the rank-sum test).

Cognitive Load Experiment:
Using SAS

Confidence Interval for the Location Parameter (Median):
Hodges Lehman Confidence Interval

https://en.wikipedia.org/wiki/Hodges%E2%80%93Lehmann_estimator

*We will look at an example later

Cognitive Load Experiment

Cognitive Load Experiment (All Together)
Ho: Distribution of Modified and Conventional Scores are equal
Ha: Distribution of Modified Scores is less than that of
Conventional
Critical Value (left sided): -1.645 (alpha = .05)
Test Statistic: z-stat = -3.0183
P-value (left sided)= .0013
Reject Ho

Statistical Conclusion (continued): A range of plausible values for how much smaller the “modified” distribution is than
the “traditional” (treatment effect) is [-158, -59] s. (95% confidence interval based on a rank-sum test) with a pointestimate of 108.5 s.

Statistical Conclusion (continued): The data provide convincing evidence that a student could
solve the problem more quickly after the “modified” rather than the “conventional” method
(one-sided, normal approximation w/ C.C. p-value = 0.0013, from the rank-sum test). A range of
plausible values for how much smaller the “modified” distribution is than the “traditional”
(treatment effect) is [-158, -59] sec. (95% confidence interval based on a rank-sum test) with a
point-estimate of 108.5 sec.

4

10/13/2018

Creativity Study: Reminder
I
E

Welch’s t-Test
What if this assumption
isn’t true?

25

Welch’s t-Test

Testing Hypothesis:
Welch’s t-Tools

28

Gender Income Discrimination

Gender Income Discrimination
Strong evidence against normality, but CTL applies.
Strong evidence against equal standard deviations and
different sample sizes. (They are close but the standard
deviations appear to be so different that this may make
a real difference.)
We will assume independence.
Student’s t-test not a good idea here.

5

10/13/2018

Rank Sum versus Welch’s … the Take Away

Gender Income Discrimination!

If you wish to make inference on the difference of means and you have the sample size to invoke the CLT, Welch’s
t-test is preferred by most statisticians, and it is robust to different standard deviations even when the sample
size is not equal.
Often, especially in skewed distributions, the median is a better measure of center. For this reason, one may
prefer the rank sum test even when Welch’s t-test is available.

Test Statistic: tstat = -3.88
P-value = .0006
Reject H0

If you have small sample sizes, you may not be very confident about the normality assumption even if the
histograms and q-q plots look okay. For this reason, one may wish to be “conservative” and run the rank sum
test and obtain inference on the median.

Conclusion: There is strong evidence to suggest that
the mean income of the female group is different
from the mean income of the male group (p-value =
.0006). A 95% confidence interval for this difference
is ($29,124, $94,176) in favor of the males.

If there are outliers or censored values, the rank sum test is often the most appropriate as the t-test is not
resistant to outliers and has no way of using censored data.

That is quite a difference!

Performance of Welch’s t-test

Paired T-Test

Paired T-Test

A Look at the Variance

Known alternatively as Matched Pairs or Dependent t-Test
Assumptions
• Data are either:
• From one sample that has been tested twice (example pre- and post-test or
repeated measures)
• From a group of subjects that are thought to be similar and can thus be
matched or paired (example from same family, or twins)
• Differences are normally distributed, independent between observations (but
dependent from one group to the next).

•If data can be paired, the variance can be reduced.

35

36

6

10/13/2018

Example:
Medical Reasoning Test

Example:
Keith’s Medical Reasoning Test

• The AMA has a diagnostic test for medical reasoning
• On average, people score about 500 points on this test
• We have data from 10 subjects who took the medical
reasoning test. These subjects were randomly selected
from St. Paul Hospital in Dallas
•Not fatigued: is the baseline, taking the test before a shift
•Fatigued: is after the treatment; working for 12
operational hours prior to re-taking the test.

Subject # Not Fatigued Fatigued
1
567
530
2
512
492
3
509
510
4
593
580
5
588
600
6
491
483
7
520
512
8
588
575
9
529
530
10
508
490

We can try to test whether the DIFFERENCE OF THE MEANS between the fatigued scores and the not
fatigued scores is less than zero.

(Lower numbers = worse score)

37

Example:
Medical Reasoning Test

38

If we did this, we would be wrong! Why?
A fundamental assumption is violated:
independence

Assumption Check Failure

We need to account for the dependence between the two groups
40

39

Example:
Keith’s Medical Reasoning Test

Paired t-test reduces to a one-sample t-test

Instead of testing the DIFFERENCE OF THE MEANS:

We should test the MEAN OF THE DIFFERENCES:

Subject
1
2
3
4
5
6
7
8
9
10

Fatigued Not Fatigued Difference
530
567
-37
492
512
-20
510
509
1
580
593
-13
600
588
12
483
491
-8
512
520
-8
575
588
-13
530
529
1
490
508
-18
41

Subject
1
2
3
4
5
6
7
8
9
10

Fatigued
530
492
510
580
600
483
512
575
530
490

(di)
Not Fatigued Difference
567
-37
512
-20
509
1
593
-13
588
12
491
-8
520
-8
588
-13
529
1
508
-18

H0: d = 0
Ha: d < 0

42

7

10/13/2018

A SAS Code Comparison

Two (independent) sample T-Test

A SAS Code Comparison

Using paired data (when appropriate) instead of
unpaired data allows us to tighten the
confidence interval for the difference in means
(yeah!) AND increase the power (the likelihood
that our data properly detects a shift in score).

Paired T-test

Paired T-test
Two (independent) sample T-Test
43

Checking the Assumptions

44

Additional Information
• We can look at a PROFILE PLOT
• The lines connect the scores on the MRT in
the “fatigued” versus “not fatigued” states
• This plot is standard for SAS proc ttest with
paired data.

There is little to no evidence that the differences do
not come from a normal distribution.
We will assume that the differences are independent.
Is this a reasonable assumption?
45

46

Appendix

Conclusion (alpha = 0.01)
Critical Value: t0.01,9 = -2.821
Test Statistic: tstat= -2.41
P-value = 0.0196 > 0.01
Fail to Reject Ho
Statistical Conclusion: There is not enough evidence to suggest that, on average, the fatigued subjects score lower than the non-fatigued
subjects (p-value = .0196). A 99% one sided confidence interval for the mean difference in scores is (-infinity, 1.76). Perhaps, a more
meaningful confidence interval would be a two-sided 98% confidence interval of (-22.36, 1.76).
Scope of Inference: Since this was a random sample from St. Paul Hospital in Dallas, we can infer that this result would be repeated for
any group selected from this hospital. There is no way to guarantee a causal inference from a paired t-test.
Note: The elusiveness of the causal inference comes from the fact that the treatment that induces fatigue may itself be a confounder.
Some may work for 12 hours as a surgeon and others may work 12 hours writing reports. There is reason to believe that if a difference is
detected, this difference may not be due to fatigue rather may be due to the type of work.

47

8

10/13/2018

Example: Nerve Data
horse
6
4
8
5
7
9
For each of the 9 horses, a veterinary anatomist measured the density 3
1
of nerve cells at specified sites in the intestine.
2

Alternatives to the t-Test for Paired Data

Using the paired t-Test

site1 site2
14.2 16.4
17
19
37.4 37.6
11.2 6.6
24.2 14.4
35.2 24.4
35.2 23.2
50.6
38
39.2 18.6

The Hypothesis Test

(TWO SIDED)
(ONE SIDED)

The sample size is rather small, hence the normality assumption is somewhat suspect.

52

Sign Test: Horse Data

(ONE SIDED, CC P-VALUE)

horse
8
4
6
5
7
9
3
1
2

Test and Conclusion
site1
37.4
17
14.2
11.2
24.2
35.2
35.2
50.6
39.2

site2
37.6
19
16.4
6.6
14.4
24.4
23.2
38
18.6

diff
-0.2
-2
-2.2
4.6
9.8
10.8
12
12.6
20.6

Sign
+
+
+
+
+
+

Critical Value (right sided): z0.05=1.645

P-value (one sided) = .2527

t statistic: tstat = 0.666

Fail to Reject H0.

Statistical Conclusion: There is not enough evidence that the median nerve density at site 1 is
greater than the median nerve density at site 2 (Wilcoxon sign test one-sided p-value of 0.2527).
K=6

54

9

10/13/2018

Signed Rank Test: Horse Data

(ONE SIDED, CC P-VALUE)

horse
8
4
6
5
7
9
3
1
2

site1
37.4
17
14.2
11.2
24.2
35.2
35.2
50.6
39.2

site2
37.6
19
16.4
6.6
14.4
24.4
23.2
38
18.6

Test, Conclusion and Some Notes
abs(diff)
0.2
2
2.2
4.6
9.8
10.8
12
12.6
20.6

Sign
+
+
+
+
+
+

rank
1
2
3
4
5
6
7
8
9

Critical Value (right sided): z0.05=1.645

P-value (one sided) = .0294

t statistic: tstat = 1.89

Reject Ho.

Statistical Conclusion: There is strong evidence that the median nerve density at site 1 is greater
than the median nerve density at site 2 (Wilcoxon signed rank test one-sided p-value of 0.0294).
Note:
S = 39

• The signed-rank test has more power than the sign test
(Compare the p-values 0.254 vs. 0.0294)
• Both tests make very few assumptions about the distributions
56

Horse Data

Note: These are two sided…. Half of this is close
to our calculated one sided p-values from
earlier.

Note: For n < 20 SAS uses the probabilities from the binomial
distribution rather than the normal approximation. These are more
accurate (exact) and we should use these when SAS is available.

10

Part V

ANOVA

136

Chapter 21

Problem 1: Plots and Logged Data
We begin our work looking at raw and transformed data.

21.1

Plots and Transformations

Raw Data Analysis
First, we will look at the raw data. To check if the raw data fits the assumptions, we will first look at a scatter plot.
The scatter plot of the raw data was produced by the following bit of SAS code:
Code 21.1. Scatterplot of Raw Data Using SAS

proc sgplot data=EduData;
scatter x=educ y=Income2005;
run;

This results in the following plot21.1:

137

Analysis Guide

Midterm

Figure 21.1.1. Scatter Plot of the Raw Data

Looking at Figure 21.1.1, we see that the raw data is very heavy in between 0 and 20,000 for all categories, but
some groups spread further and wider than others, which suggests the variances may not be equal. The heaviness
of the lower end of each group may also suggest a lack of normality. We will examine this further with some Box
plots. These were produced using the following chunk of SAS code: This results in the following plot:
Code 21.2. Boxplot of Raw Data Using SAS
proc sgplot data=EduData;
vbox Income2005 / category=educ
dataskin=matte
;
xaxis display=(noline noticks);
yaxis display=(noline noticks) grid;
run;

138

Analysis Guide

Midterm

Figure 21.1.2. Box Plot of the Raw Data

Figure 21.1.2 tells us a lot about our data. We see from the size and shape of the boxes that the variances of our
data are by no means homogeneous. Note that there are a lot of outliers while the distribution is heavily weighted
towards the bottom, this suggests our data may have departed from normality. We will examine this phenomenaa
further using histograms.
To produce histograms of the raw data, the following SAS code was used: This results in the following plot:
Code 21.3. Histogram of Raw Data Using SAS
proc sgpanel data=EduData;
panelby educ / rows=5 layout=rowlattice;
histogram Income2005;
run;

139

Analysis Guide

Midterm

Figure 21.1.3. Histogram of the Raw Data

Figure 21.1.3 confirms our suspicions, the variances of the data are likely unequal, but more importantly, the
data is clearly skewed to the right. We will confirm this using Q-Q plots.
To produce Q-Q plots of the raw data, the following SAS code was used:
Code 21.4. Q-Q of Raw Data Using SAS
/* Normal = blom produces normal quantiles from the data */
/* To find out more, look at the SAS documentation!*/
proc rank data=EduData normal=blom out=EduQuant;
var Income2005;
/* Here we produce the normal quantiles!*/
ranks Edu_Quant;
run;
proc sgpanel data=EduQuant;
panelby educ;
scatter x=Edu_Quant y=Income2005 ;
colaxis label="Normal Quantiles";
run;

This results in the following plot:

140

Analysis Guide

Midterm

Figure 21.1.4. Q-Q Plot of the Raw Data

The Q-Q plots in Figure 21.1.4 tell us what we already know: The raw data is not normal, and does not have
equal variances. The ANOVA test is not super robust to highly skewed, long tailed data, and it relies entirely on
equal variances, so we absolutely cannot use the raw data

141

Analysis Guide

Midterm

Transformed Data Analysis
Now we will perform a log transformation on the data and see if that helps it meet our assumptions better. To do
a log transformation, we will employ the following SAS code: We will begin our analysis of the transformed data
Code 21.5. Logging of Raw Data Using SAS
data LogEduData;
set EduData;
LogIncome=log(Income2005);
run;

with a scatter plot, produced with the following SAS code: This results in the following plot:
Code 21.6. Scatterplot of Logged Data Using SAS
proc sgplot data=LogEduData;
scatter x=educ y=LogIncome;
run;

Figure 21.1.5. Scatter Plot of the Log-Transformed Data

As we can see in Figure 21.1.5, the groups have a much more similar size, suggesting similar variances, and the
heavy part of the scatter plot is closer to the center, in between the outliers, which tells us the log transformation
may have done a good deal towards normalizing our data. We can examine this further using Box plots.
To produce Box plots of the transformed data, the following SAS code was used: This gives us the following
plot:

142

Analysis Guide

Midterm

Code 21.7. Boxplot of Logged Data Using SAS
proc sgplot data=LogEduData;
vbox LogIncome / category=educ
dataskin=matte
;
xaxis display=(noline noticks);
yaxis display=(noline noticks ) grid;
run;

Figure 21.1.6. Box Plot of the Log-Transformed Data

Figure 21.1.6 gives us some useful information about our data. We see the boxes and whiskers are of similar
size, which tells us the variances are likely homogeneous. Furthermore, the medians and means are near each
other, and the boxes are near the center of the distribution, which suggests that the data may be normal. We will
examine these two phenomena further with histograms. To produce histograms of the log-transformed data, the
following SAS code was used: This results in the following plot:
Code 21.8. Histogram of Logged Data Using SAS
proc sgpanel data=LogEduData;
panelby educ / rows=5 layout=rowlattice;
histogram LogIncome;
run;

143

Analysis Guide

Midterm

Figure 21.1.7. Histogram of the Log-Transformed Data

From the spread of the histograms in Figure 21.1.7, we see two things. First, the similar width of the histograms
confirms that variances are roughly equal. Second, the shape of the histograms, and their location near the center
suggests that the data is very nearly normal. We will further examine the normality of the data using Q-Q plots.
To produce the Q-Q plots of the transformed data, the following SAS code was used: This results in the following
Code 21.9. Q-Q of Logged Data Using SAS
proc rank data=LogEduData normal=blom out= LogEduQuant;
var LogIncome;
ranks LogEduQuant;
run;
proc sgpanel data=LogEduQuant;
panelby educ;
scatter x=LogEduQuant y=LogIncome ;
colaxis label="Normal Quantiles";
run;

plot:

144

Analysis Guide

Midterm

Figure 21.1.8. Q-Q Plot of the Log-Transformed Data

Examining Figure 21.1.8, we see a confirmation of our beliefs: The log-transformed data, when plotted against
normal quantiles, is fairly normal. This means, with the log transformed data, we can reasonably assume normality
and homogeneity of variances.

21.2

Complete Analysis

We will now perform a complete analysis of our data, using Pure ANOVA.

Problem Statement
We would like to determine whether or not at least one of the five population distributions (corresponding to
different years of education) is different from the rest.

Assumptions
As seen in Section 21.1, the raw data does not meet the assumption of normality nor of homogeneity of variance.
However, in Section 21.1, we proved that after a log transformation, the data does meet both of these assumptions.
The ANOVA test is fairly robust to the slight departure from normality presented by the log transformed data, and
the variances are equal. The data is clearly independent, so that assumption is met. Therefore, all assumptions of
ANOVA are met by the log transformed data.

Hypothesis Definition
In this problem, our Null (Reduced Model) Hypothesis, H0 , is that all the groups have the same distribution and our
Alternative (Full Model) Hypothesis, H1 is that the distributions are different. Mathematically, that is written as:
H0 :mediangrand
H1 :median<12

mediangrand
median12

mediangrand

median13−15

We will consider our confidence level, α to be 0.05

145

mediangrand

median16

mediangrand

median>16

(21.2.1)
(21.2.2)

Analysis Guide

Midterm

F Statistic
To conduct this hypothesis test, the following SAS code was used: This results in the following ANOVA Output:
Code 21.10. ANOVA Test Using SAS

proc glm data = LogEduData;
class educ;
model LogIncome = educ;
run;

Figure 21.2.1. ANOVA Table

Figure 21.2.1 tells us what our F statistic is. We see that
F = 62.87

(21.2.3)

p < .0001

(21.2.4)

P-value
Figure 21.2.1 also tells us our p-value. In this case,

Hypothesis Assessment
In this scenario, we have that p < .0001 < α = .05 and therefore we reject the null hypothesis.

Conclusion
There is substantial evidence (p < 0.0001) that at least one of the distributions is different from the others. To further
examine this, we will see if the distribution varies within similar levels of schooling. We will compare <12 and 12
years of school, 12 and 13-15 years of school, 13-15 and 16 years of school, and 16 and >16 years of school. To
do this, we will compare medians, using the following SAS code: This results in the following Table:
Code 21.11. Comparison of distributions using SAS
proc sort data=LogEduData;
by educ;
run;
proc means data = LogEduData
by educ;
var LogIncome;
run;

146

median order=data;

Analysis Guide

Midterm

Table 21.1. Comparison of Logged Means

Education µ
<12
12
13-15
16
>16

9.9
10.22
10.39
10.79
10.89

From Table 21.1, we can calculate the differences of the means for our log transformed groups, and see how
much the distributions differ, shown in the following table:
Table 21.2. Comparison of Distributions

Pair

Difference

Multiplicative Effect (eµ1 −µ2 )

% Increase

<12 and 12
12 and 13-15
13-15 and 16
16 and >16

0.32
0.17
.4
.1

1.38
1.19
1.49
1.11

38
19
49
11

Table 21.2 shows us how many times greater the distribution of the income of the larger education in each pair
is than the lower education level.

Scope of Inference
As this was a random sample, we can make inferences about the population, however, we cannot make causal
inferences, as this was not a randomized experiment. That means, we can say that in general, people with X years
of education make Y many times as people with Z years of education, but we cannot say it is due to the education
itself.

21.3

Extra Values

The extra values were produced with the same code as in Section 28.1. They can be found in Figure 21.2.1, and in
the figure below:
Figure 21.3.1. Extra Values

Value of R2
Figure 21.3.1 tells us R2 is 0.0888

Mean Square Error and Degrees of Freedom
The Mean Square Error, shown in Figure 21.2.1, is 2232.12, with 2579 degrees of freedom

ANOVA in R!
Here is the R code and output to do ANOVA in R on the log transformed data:
147

Analysis Guide

Midterm

Code 21.12. ANOVA in R
1
2
3

# #################### Anova in R ######################
edudata <- read.csv(file='data/ ex0525 .csv ', header =TRUE , sep = ",")
edudata $ logincome <- log( edudata $ Income2005 )

4
5
6
7

# http://www. sthda .com/ english /wiki/one -way -anova -test -in -r
anovatest <- aov( logincome ~Educ ,data = edudata )
summary ( anovatest )

8
9

# ######################## Results #####################

10
11
12
13

Df Sum Sq Mean Sq F value Pr(>F)
Educ
4 217.7
54.41
Residuals
2579 2232.1
0.87

62.87 <2e -16 ***

148

Chapter 22

Problem 2: Build Your Own Anova!
In this section we will be building an ANOVA table to determine whether or not the distribution of income of people
with > 16 years is different than the distribution of income of people with exactly 16 years of education. To build this
ANOVA table, we need two preliminary ANOVA analyses. First, is the ANOVA analysis seen in Section 21.2. This
has the null hypothesis that all the distributions are the same, and the alternative hypothesis that the distributions
differ. Next, we build a second ANOVA table, which will have a null hypothesis that all the distributions are the
same, and an alternative hypothesis that all the distributions are different, except the group with 16 years and the
group with >16 years are still the same. This is done by grouping the two into one group, with the following SAS
code: Next, to compute important parameters, an ANOVA test is conducted on the grouped, logged, data, with
Code 22.1. Regrouping data using SAS
data EduGroupData;
set LogEduData;
Others = educ;
if educ eq "16" educ = ">16" then Others="a";run;

the following bit of code: This results in the following intermediate ANOVA table:
Code 22.2. Secondary ANOVA using SAS
proc glm data = EduGroupData;
class Others;
model LogIncome = Others;
run;

Figure 22.0.1. Grouped ANOVA Table

22.1

Building the Extra Sum of Squares Anova Table

Using the data from 22.0.1 and the data from 21.2.1, we can make our own ANOVA table, which has a null hypothesis that all the distributions different and (except 16 and >16, which are the same), and an alternative hypothesis
that all the distributions are different. Since both hypotheses have the same prediction about the data for <12, 12,
and 13-15, the null hypothesis of our custom-made ANOVA table is that 16 and >16 have the same distribution,
149

Analysis Guide

Midterm

and the alternative is that they have different distributions. We will now construct our new, extra sum of squares
ANOVA table.
First, for our full model (the ”Error” row in the ANOVA table), we will use the full model (alternative hypothesis,
or the ”Error” row), from Figure 21.2.1. This represents our alternative hypothesis, where the distribution of 16
and >16 are different. Next, we will construct our reduced model (The ”Total” row in the ANOVA table) using the
full model (alternative hypothesis, or the ”Error”) from 22.0.1. This represents our null hypothesis, where 16 and
>16 have the same distribution. To generate our Model, or Extra Sum of Squares, which will allow us to find our F
statistic and p value, we need to take a couple of steps. To determine the number of degrees of freedom of our
model, we subtract the number of degrees of freedom from the Error row from the number of degrees of freedom
of the Total row. To calculate the extra sum of squares, we subtract the residual sum of squares of the full model
(error) from the residual sum of squares of the reduced model (total). Then, to find the mean square, we divide
the extra sum of squares by the number of degrees of freedom in our model. Our F statistic is then produced by
normalizing the Extra Sum of Squares, dividing it by the Mean Square Error (in the Error row). To get a p value from
the F statistic, we examine an F distribution with degrees of freedom = dfdfmodel
. The results of these computations
f ull
are displayed in the following table:
Table 22.1. Homemade ANOVA Table

Source

DF

Model (Extra SS) 1
Error (Full)
2579
Total (Reduced) 2580

22.2

Sum of Squares

Mean Square

F Value

Pr>F

1.98
2232.12
2234.1

1.98
.86

2.3

0.129

Complete Analysis

Problem Statement
We would like to determine whether or not people with a college degree or a graduate degree have different
distributions of incomes.

Assumptions
There are three assumptions of ANOVA: normality, homogeneity of variance, and independence. We have shown,
in Section 21.1 that while the raw data does not meet the first two assumptions, the log transformed data does.
Both the transformed and raw data meet the assumption of independence. We will proceed with our ANOVA test.

Hypothesis Definition
Our null hypothesis states that the distribution of the >16 and 16 groups is the same, and our alternative hypothesis
states that the distribution of the >16 and 16 groups is different. We proved this in Section 22.1, and this is written
mathematically as:
H0 :median<12

median12

median13−15

median16, >16

H1 :median<12

median12

median13−15

median16

median16, >16

median>16

(22.2.1)
(22.2.2)

OR:
H0 :median16 = median>16

(22.2.3)

H1 :median16 6= median>16

(22.2.4)

We will consider our confidence level, α to be 0.05

150

Analysis Guide

Midterm

F Statistic
The F statistic is calculated with the following equation:


F =

SSextra
DFextra

σf2ˆull


=

SSextra
DFextra



M SE

(22.2.5)

The results of this calculation can be seen in Table 22.1, we have that F = 2.3 This is a small F statistic, which is likely
indicative of weak evidence.

P-value
The P value is calculated using F, the Extra degrees of freedom, and the Full (Error) degrees of freedom. Using the
values calculated in Table 22.1, we have that p = 0.129

Hypothesis Assessment
At a confidence level α = 0.05, we have that p = .0129 > α = .05. Therefore, we cannot reject the null hypothesis.

Conclusion
There is not enough evidence to suggest that the distribution of income of people with a college only (16 years) is
different from the distribution of income of people with a postgraduate education (>16 years).

Scope of Inference
It is not necessary to write a scope of inference as we did not reject the null hypothesis, however this is a random
sample, so we can make inferences about the population as whole, but we cannot infer causality, as this was not a
random experiment.

22.3

Degrees of Freedom and Comparison to T-Test

This test had 2579 degrees of freedom (as seen in Table 22.1). This is a lot more than than the t test, which is a lot
more than the number of degrees of freedom in the t test. Therefore, this ANOVA test has more power than the t
test!.

151

Chapter 23

Problem 3: Nonhomogeneous Standard
Deviations
23.1

Complete Analysis

Problem Statement
We would like to determine whether or not at least one of the five population distributions (corresponding to
different years of education) is different from the rest.

Assumptions
As seen in Section 21.1, the raw data does not meet the assumption of normality nor of homogeneity of variance.
However, in Section 21.1, we proved that after a log transformation, the data is at least normal. The ANOVA test is
fairly robust to the slight departure from normality presented by the log transformed data, so we can safely assume
normality. However, we cannot assume homogeneity variances. Therefore, pure ANOVA is not appropriate. Since
the data is to some extent normal, we should try and use a parametric test, as they have more power in general
than their nonparametric analogs. Therefore, the Kruskal-Wallis test is not the most appropriate test. We will instad
use Welch’s ANOVA Test, which assumes normality but does not assume homogeneity of variance, on the log
transformed data. We can assume the data is independent.

Hypothesis Definition
In this problem, our Null (Reduced Model) Hypothesis, H0 , is that all the groups have the same distribution and our
Alternative (Full Model) Hypothesis, H1 is that the distributions are different. Mathematically, that is written as:
H0 :mediangrand
H1 :median<12

mediangrand
median12

mediangrand

median13−15

mediangrand

median16

mediangrand

median>16

We will consider our confidence level, α to be 0.05

F Statistic
To conduct this hypothesis test, the following SAS code was used: This results in the following table:
Code 23.1. Welch’s ANOVA in SAS
proc glm data = LogEduData;
class educ;
model LogIncome = educ;
means educ / welch;
run;

152

(23.1.1)
(23.1.2)

Analysis Guide

Midterm

Figure 23.1.1. Welch’s ANOVA Table

From Figure 23.1.1, we have that F = 56.59. This is a pretty large F statistic, which means that we probably have
some good evidence in favor of the alternative hypothesis.

P-value
Figure 23.1.1 Also tells us that the p-value associated with the F statistic, which is given as p < 0.0001.

Hypothesis Assessment
We have that p < 0.0001 < α = .05 and therefore we Reject the null hypothesis

Conclusion
There is convincing evidence (p < 0.0001) that at least one of the distributions is different from the others.

Scope of Inference
As this was a random sample, we can make inferences about the population, however, we cannot make causal
inferences, as this was not a randomized experiment. That means, we can say that in general, people with X years
of education make Y many times as people with Z years of education, but we cannot say it is due to the education
itself.

153

Chapter 24

unit 5 lecture slides
More slides

154

10/13/2018

ANOVA
1. Make a Scatterplot of the data in the table below. “Level” is
the Explanatory Variable (X=1, 2, or 3).

UNIT 5: Chapter 5

Level i=1

Level i=2

Level i=3

Y1|X=i

3

10

20

Y2|X=i

5

12

22

Y3|X=i

7

14

24

ANOVA
2. Find the Grand Mean … this is the mean of
all the Ys together … regardless of Level.

Pure ANOVA

ANOVA

4. Now we need to find the Sum of the Squared
Residuals for the Equal Means Model.

1. Make a Scatterplot of the data in the table below. “Level” is
the Explanatory Variable (X=1, 2, or 3).
Level i=1

Level i=2

Level i=3

Y1|X=i

3

10

20

Y2|X=i

5

12

22

Y3|X=i

7

14

24

5

12

22

2. Find the Grand Mean … this is the mean of
the sample means. If the sample size is the
same in each group, then this is the mean of
all the Ys together … regardless of Level.

Level i=1

Level i=2

Level i=3

Y1|X=i

3

10

20

Y2|X=i

5

12

22

Y3|X=i

7

14

24

5

12

22

Level i=1

Level i=2

Level i=3

Level i=1

Level i=2

Level i=3

6. Compare the Total Sum of Squares for each model. Which do you think “fits” better?

Pure ANOVA
4. Now we need to find the Sum of the Squared
Residuals for the Equal Means Model.

Level i=1

Level i=2

Level i=3

Y1|X=i

3

10

20

Y2|X=i

5

12

22

Y3|X=i

7

14

24

5

12

22

Pure ANOVA
4. Now we need to find the Sum of the Squared
Residuals for the Equal Means Model.

Level i=1

Level i=2

Level i=3

Y1|X=i

3

10

20

Y2|X=i

5

12

22

Y3|X=i

7

14

24

5

12

22

Level i=1

Level i=2

Level i=3

Level i=1

Level i=2

Level i=3

(3-13)2 = 100

(10-13)2 = 9

49

(3-13)2 = 100

9

49

(5-13)2 = 64

1

81

64

1

81

36

1

121

36

1

121

Level i=1

Level i=2

Level i=3

6. Compare the Total Sum of Squares for each model. Which do you think “fits” better?

Level i=1

Level i=2

Level i=3

(3-5)2 = 4

(10-12)2 = 4

(20-22)2 = 4

0

0

0

4

4

4

6. Compare the Total Sum of Squares for each model. Which do you think “fits” better?

1

10/13/2018

Pure ANOVA

Sum of Squares in ANOVA
Between group variation (top row)

Total variation (bottom row)

Level
i=1

Level
i=2

Level
i=3

Y1|X=i

3

10

20

Y2|X=i

5

12

22

Y3|X=i

7

14

24

7. Now we would like to make an ANOVA table to test the alternative hypothesis!
Formally write the Ho and Ha and fill in the table.
df
Within group variation (middle row)

SS

MS

F

Pr > F

Model / Extra SS
*To compute the sum of squares column
for the ANOVA table, square each
distance (lines in black) and then add.
The sum of squared* distances (black
lines) for left two graphs = the sum of
squared distances (black lines) for the
right graph.

Error / Residual/Full Model
Total (Reduced)

Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full

*Each distance squared for the top left graph is multiplied by
the number in each group.

Pure ANOVA

Pure ANOVA

7. Now we would like to make an ANOVA table to test the alternative hypothesis!
Formally write the Ho and Ha and fill in the table.
Ho : µ 1 = µ 2 = µ 3
Ha: At least 1 pair are different

Formally write the Ho and Ha and fill in the table.

(Equal Means Model µ µ µ)
(Separate Means Model µ1 µ2 µ3)

df

SS

MS

F

Ho : µ 1 = µ 2 = µ 3
Ha: At least 1 pair are different

Pr > F

Model / Extra SS
Error / Residual/Full Model

6

24

Total (Reduced)

8

462

7. Now we would like to make an ANOVA table to test the alternative hypothesis!

4

Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full

df

SS

Model / Extra SS

8-6=2

462-24=438

Error / Residual/Full Model

6

24

Total (Reduced)

8

462

MS

F

Pr > F

4

Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full

Pure ANOVA

Pure ANOVA

7. Now we would like to make an ANOVA table to test the alternative hypothesis!
Formally write the Ho and Ha and fill in the table.
Ho : µ 1 = µ 2 = µ 3
Ha: At least 1 pair are different

(Equal Means Model µ µ µ)
(Separate Means Model µ1 µ2 µ3)

7. Now we would like to make an ANOVA table to test the alternative hypothesis!
Formally write the Ho and Ha and fill in the table.

(Equal Means Model µ µ µ)
(Separate Means Model µ1 µ2 µ3)

F

Ho : µ 1 = µ 2 = µ 3
Ha: At least 1 pair are different

Pr > F

(Equal Means Model µ µ µ)
(Separate Means Model µ1 µ2 µ3)

df

SS

MS

df

SS

MS

F

Model / Extra SS

2

438

438/2=219

Model / Extra SS

2

438

219

219/4=54.75

Error / Residual/Full Model

6

24

4

Error / Residual/Full Model

6

24

4

Total (Reduced)

8

462

Total (Reduced)

8

462

Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full

Pr > F

Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full

2

10/13/2018

F -Test of Different Means …

Pure ANOVA

Ho: µ1= µ2 = µ3
Ha: At least 1 pair are different

7. Now we would like to make an ANOVA table to test the alternative hypothesis!

(Equal Means Model)
(Separate Means Model)

Formally write the Ho and Ha and fill in the table.
Ho : µ 1 = µ 2 = µ 3
(Equal Means Model µ µ µ)
Ha: At least 1 pair are different
(Separate Means Model µ1 µ2 µ3)

df

SS

MS

F

Pr > F

Model / Extra SS

2

438

219

54.75

.0001

Error / Residual/Full Model

6

24

4

Total (Reduced)

8

462

Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full

6 Steps for ANOVA F Test (diff means)!
1.

Ho: µ1= µ2 = µ3
Ha: At least 1 pair are different

2.

Critical value: You can skip this step for ANOVA.

3.

F statistic = 54.75

4.

P-value = .0001

5.

Reject Ho.

6.

The evidence suggests that at least 1 pair
of the group means are different. (P-value
< .0001 from an ANOVA.)

(Equal Means Model)
(Separate Means Model)

R-Squared!
R2

F-Distribution

Coefficient of Variation

R =correlation coefficient
= coefficient of determination

3

10/13/2018

ANOVA: Assumptions and Robustness
1. Normality: Similar to t-tools hypothesis testing,
ANOVA is robust to this assumption. Extremely longtailed distributions (outliers) or skewed distributions,
coupled with different sample sizes (especially when
the sample sizes are small) present the only serious
distributional problems.
2. Equal Standard Deviations: This assumption is crucial,
paramount, and VERY important.
3. The assumptions of independence within and across
groups are critical. If lacking, different analysis should
be attempted.

More on Constant SD

Samples drawn from
Normal Distributions
• Same visual checks as with t-tools, just for
more groups.
– Histograms
– Q-Q plots

Levene’s Test (Median)

Ho: σ1= σ2
Ha: σ1≠ σ2

95% confidence interval accuracy with different sample
sizes and standard deviations for three groups.

But … proc ttest does not have Levene’s Test!!!

Proc GLM Has Levene’s Test

Check of Assumptions: Constant SD

There is some visual evidence against
equal standard deviations. The BrownForsythe test was used as secondary
evidence and does not provide
significant evidence against equal
standard deviations. (p-value = .2558)

4

10/13/2018

Archeology in New Mexico
An archeological dig in New Mexico yielded four
sites with lots of artifacts. The depth (cm) that each
artifact was found was recorded along with which
site it was found in.
The researcher has reason to believe that sites 1
and 4 and sites 2 and 3 may be similar in age. In
theory, the deeper the find, the older the village.
Is there any evidence that sites 1 and 4 have a
mean depth that is different than the mean depth
of artifacts from sites 2 and 3?

Archeology Example
Assumptions: Normality

Archaeology Example
Depth
93
120
65
105
115
82
99
87
100
90
78
95
93
88
110

Site
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

Depth
85
45
80
28
75
70
65
55
50
40

Site
2
2
2
2
2
2
2
2
2
2

Depth
100
75
65
40
73
65
50
30
45
50
45
55

Site
3
3
3
3
3
3
3
3
3
3
3
3

Depth
96
58
95
90
65
80
85
95
82

Site
4
4
4
4
4
4
4
4
4

Archeology Example
Assumptions: Homogeneity (Equal SD)

Histograms will be helpful as well!

Archeology Example
Assumption: Independence
The discovered artifacts associated with the
depths were randomly selected from the log
(book of recordings … not logarithms!) of
discoveries.
Since the artifacts and, thus, the depths are
associated with completely different sites, it is
assumed that the data are independent
between sites.

Question of Interest:
1. Are any of the means different?
2. Are the means of sites 1 and 4 different?
3. Are the means of sites 2 and 3 different?
4. Satisfactory results of questions 1 and 2 will allow us to ask
the third question: are sites 1 and 4 different than 2 and 3?

5

10/13/2018

Are sites 1 and 4 different from 2 and 3? *Assumes ANOVA assumptions are met

Stop:
Insufficient
evidence
that any
means are
different

Perform regular ANOVA to
test if any of the means are
different from the rest.
Reduced Model Ho: µ µ µ µ
Full Model Ha: µ1 µ2 µ3 µ4

BYO ANOVA to test if the
means of 2 and 3 are different,
given at least one pair is
different.
Reduced Model Ho: µ1 µ0 µ0 µ4
Full Model Ha: µ1 µ2 µ3 µ4

Reject Ho in
favor of Ha:
µ1 µ2 µ3 µ4?

Reject Ho in
favor of Ha:
µ1 µ2 µ3 µ4?

no

BYO ANOVA to test if the
means of 1 and 4 are different,
given at least one pair is
different.
Reduced Model Ho: µ0 µ2 µ3 µ0
Full Model Ha: µ1 µ2 µ3 µ4

yes

yes

no

yes

Stop:
Groups 1
and 4 are
different
and should
not be
treated as
having the
same
means, as
the QoI
suggests.

Stop:
Groups 2
and 3 are
different
and should
not be
treated as
having the
same
means, as
the QoI
suggests.

Stop:
Evidence
does NOT
support the
claim in QoI

Reject Ho in
favor of Ha:
µa µb µb µa ?

no

Question of Interest:
2. Are the means of sites 1 and 4 different?

Compare this model
against equal means
model (µ µ µ µ)

*Recode the
variables into
three groups: 2,
o
3, and 1/4
combined and
a
perform ANOVA
to get the first (Ho) Reduced: µ µ µ µ
table.
(H ) Full*: µ µ µ µ

(H ) Reduced Model: µo µ2 µ3 µo
(H ) Full Model: µ1 µ2 µ3 µ4
2

3

Source

DF SS

MS

F

Pr>F

Model (Full)

1

780.3

2.86

.098

Error (From Full)

42 11464.6

Total (From Reduced*)

43 12244.9

780.3

Compare this model
against equal means
model (µ µ µ µ)

(Ho) Reduced: µ µ µ µ
(Ha) Full: µ1 µ2 µ3 µ4

o

273.0

There is not enough
evidence to suggest
(alpha = .05, p-value =
.098) that site 1 and
site 4 have different
mean depths.

Question of Interest: (try it!)
3. Are the means of sites 2 and 3 different?
*Recode the
variables into
three groups:
1, 4, and 2/3
combined and
perform
ANOVA to get
the first table.

(Ho) Reduced Model: µ1 µo µo µ4
(Ha)Full Model:
µ 1 µ 2 µ3 µ4
(Ho) Reduced: µ µ µ µ
(Ha) Full*: µ1 µo µo µ4

Source

DF SS

(Ho) Reduced: µ µ µ µ
(Ha) Full: µ1 µ2 µ3 µ4

MS

Model (Full)
Error (From Full)

42 11464.6

Total (From Reduced)

43 11477.7

273

F

Pr>F

There is evidence to suggest that at the alpha = .05 level of significance (pvalue < .0001) that at least 2 of the sites have different mean depths.

Stop:
Evidence
does
support the
claim in QoI

yes

no

o

(Ho) Reduced Model: µ µ µ µ
(Ha) Full Model: µ1 µ2 µ3 µ4

Perform ANOVA to test if the means of 1 and 4,
when taken together are different than means
2 and 3, when also taken together.
Reduced Model Ho: µ µ µ µ
Full Model Ha: µa µb µb µa

Reject Ho in
favor of Ha:
µ1 µ2 µ3 µ4?

a

First Ask: Is there reason to believe any
of them are different?

The reduced and
full models are
associated with
Ho and Ha,
respectively,
although they
are not exactly
equal to the
hypotheses.

Question of Interest: (try it!)
3. Are the means of sites 2 and 3 different?
*Recode the
variables into
three groups: 1,
1 o o 4
o
4, and 2/3
combined and
1 2 3 4
a
perform ANOVA
(Ho) Reduced: µ µ µ µ
(Ho) Reduced: µ µ µ µ
to get the first
table.
(Ha) Full: µ1 µ2 µ3 µ4
(Ha) Full*: µ1 µo µo µ4

(H ) Reduced Model: µ µ µ µ
(H ) Full Model: µ µ µ µ

Source

DF SS

MS

F

Pr>F

Model (Full)
Error (From Full)
Total (From Reduced*)

Question of Interest: (try it!)
3. Are the means of sites 2 and 3 different?
*Recode the
variables into
three groups:
1, 4, and 2/3
combined and
perform
ANOVA to get
the first table.

(Ho) Reduced Model: µ1 µo µo µ4
(Ha) Full Model: µ1 µ2 µ3 µ4
(Ho) Reduced: µ µ µ µ
(Ha) Full*: µ1 µo µo µ4

(Ho) Reduced: µ µ µ µ
(Ha) Full: µ1 µ2 µ3 µ4

Source

DF SS

MS

F

Pr>F

Model (Full)

1

13.1

13.1

.048

.828

Error (From Full)

42 11464.6

273

Total (From Reduced)

43 11477.7

There is not enough
evidence to suggest
(alpha = .05, p-value =
.828) that site 2 and site
3 have different mean
depths.

6

10/13/2018

Question of Interest:
4. Are sites 1 and 4 different than 2 and 3?
*Recode the
variables into two
groups 1/4 and 2/3
and perform ANOVA
to get the table.

A Small Example

(Ho) Reduced: µ µ µ µ
(Ha) Full: µb µa µa µb

There is sufficient evidence to suggest (alpha = .05,
p-value < .0001) that sites 1 and 4 have different
mean depths than sites 2 and 3.

Normality Assumption

Homogeneity of Variance Assumption

There is some (weak) evidence in
support of these data coming from
distributions with different standard
deviations. If the standard deviation
assumption and normality
assumption are both violated, what
should we do?
There is strong evidence against these data
coming from a normal distribution and the
sample size is small. ANOVA? WELCH’S ANOVA?

So …. NONPARAMETRIC!!!!

Kruskal-Wallis Test

There is not sufficient evidence at the alpha = .05 level of significance (p-value =
.3766 from Kruskal-Wallis Test) to suggest that at least two of the medians are
different.
Notice that each test failed to reject their respective Ho. The point isn’t so much
that one test will reject when the other will fail to reject. We must remember
that as statisticians, we don’t personally favor one outcome over the other. We
just want the appropriate test: the one with the most power. Kruskal-Wallis Test is
the appropriate test here.

7

10/13/2018

Another Analysis!!!!

Normality Assumption

…

There is strong evidence in
favor of these data coming
from a normal distribution.
We will proceed under this
assumption.

Assumptions and Analysis:

There is strong evidence in support of these data
coming from distributions with different standard
deviations. We will proceed under this
assumption and run the Welch’s ANOVA.

Regular ANOVA:

There is sufficient evidence at the alpha = .05 level of
significance (p-value = .0201 from Welch’s ANOVA) to
suggest that at least two of the means are different.
However, remember caveat to any different SD’s
approach.

Fixed or random effects

Fixed Effects vs. Random Effects
Quick answer:
• Do your groupings exhaust the data (e.g., data on
four different machines and there are only four
machines)? Fixed Effects! Use Proc GLM in SAS.
• Are your groupings a random sample of a larger
population that could have been chosen to be a
group (e.g., data on four different machines that
were chosen from a random sample of 100
machines)? Random Effects! Use Proc Mixed in
SAS.

APPENDIX

Measured the amount of liquid in twenty randomly selected cans of
Coke and twenty randomly selected cans of Diet Coke at a regional
bottling company. Coke and Diet Coke are bottled using different types
of machines.

Scenario 1: There is only one machine of each type.
Fixed Effects
Scenario 2: There are several of each type of machine.
The Coke samples all came from the same Coke
bottling machine, and the Diet Coke samples all came
from the same Diet Coke machine.

Random effects

8

10/13/2018

MSE vs. Variance in each group

Examples

Another example!
5 different sports were analyzed to see if the average height of basketball
players was greater than the average of all the other sports. We could, of
course, compare each pairwise grouping of sports, but that would result in
4 tests. This would take a lot of time, and those tests would each have less
power since they don’t use all the data. Let’s use ANOVA similarly to how
we did in prior problems.
1. Make a side by side box plot of the data.
2. Run a basic ANOVA to test for any pairwise difference of means.
Check the assumptions here, but no need to address them after this.
3. Test the model that keeps basketball by itself but groups the other
sports as “others.”
4. Use the previous two models to conduct an extra sum of squares FTest:
Ho: Reduced Model: µB µO µO µO µO
Ha: Full Model: µB µF µSoc µSwim µT
5. Depending on the results of this test, test to see if there is evidence
that basketball has a different mean than each of the sports.
(Equivalent to testing basketball versus the others.)
Ho: Reduced Model: µO µO µO µO
Ha: Full Model:
µB µO µO µO

µO
µO

6. Make sure and provide written conclusions for questions 2,3,4 and 5.

9

10/13/2018

First … Plot the Data!

Plot the Data cont.

Normality: We have very small sample sizes here. There is not a lot of evidence against
normality for each group, although there is not a lot of evidence to begin with. We will
proceed with caution under the assumption of normal distributions for each sport.
Homogeneity of Variance: Judging from the box plots, there is some visual evidence
against equal standard deviations, although the sample size is still small. A secondary
test would be nice to lean on here.
We will assume the observations are independent both between and within groups.

Brown and Forsythe Test for Equality
of Variance.

1 Way ANOVA
Ho: µBasketball = µFootball= µSoccer = µSwim = µTennis
Ha: At least one pair of means is different.

There is some visual evidence against equal standard deviations between
sports. The Brown and Forsythe test was used as secondary evidence and
does not provide significant evidence against equal standard deviations. (pvalue = .9672)

There is strong evidence to suggest that the at least one of the sports has a mean height
that is different than the others (p-value < .0001 from an ANOVA).

Ho: µBasketball = µFootball= µSoccer = µSwim = µTennis
Ha: At least one pair of means are different.

F-TEST
Ho: The Others are equal. (Including Basketball)

Same Test as last slide ….
F-TEST
Different Notation
Ho: Reduced Model: µB µO µO µO µO

Ho: Reduced Model: µ µ µ µ µ
Ha: Full Model: µB µF µSoc µSwim µT

Ha: The Others are different (Including Basketball)

Ho: µBasketball = µFootball= µSoccer = µSwim = µTennis
Ha: µBasketball is different than the Others.

Fail to Reject Ho
There is not sufficient evidence at
the alpha = .05 level of significance
(p-value = 0.5375) to suggest that
the mean heights of non-basketball
sports are not equal. Therefore we
will proceed as if they are equal.

Ha: Full Model: µB µF µSoc µSwim µT

Ho: Reduced Model: µ µ µ µ µ
Ha: Full Model: µB µO µO µO µO

Fail to Reject Ho
There is not sufficient evidence at
the alpha = .05 level of significance
(p-value = 0.5375) to suggest that
the mean heights of non-basketball
sports are not equal. Therefore we
will proceed as if they are equal.

10

10/13/2018

µB µO µO µO µO

µB µF µSoc µSwim µT

Ho: µBasketball = µOthers
Ha: µBasketball ≠ µOthers

F-TEST: Another Look
Ho: Reduced Model: µB µO µO

µO

µO

Ha: Full Model: µB µF µSoc µSwim µT
Source

Since we are proceeding under the assumption
that the mean heights of the other sports
(besides basketball) are equal, we can test
whether basketball has a mean height different
than the other sports by testing:

DF

SS

MS

F

Pr > F

Model

3

11.63

3.87

.74

0.5375

Error

27

141.56

5.24

Corrected Total

30

153.19

There is strong evidence at the
alpha = .05 level of significance (pvalue < .0001) that supports the
claim that the mean height of
basketball players is different than
that of the other 4 sports.

Resources
www.itl.nist.gov/div898/handbook/prc/section4/prc433.htm

Spock Example

Spock Trial

The Raw Data

• 1968: Dr. Ben Spock was accused of conspiracy to violate the
Selective Service Act by encouraging young men to resist being
drafted into military service for Vietnam.
• Jury Selection: A “venire” of 30 potential jurors is selected at
random from a list of 300 names that were previously selected at
random from citizens of Boston.
• A jury is then selected NOT at random by the attorneys trying the
case.
• For this case, the venire consisted of only one woman, who was let
go by the prosecution, thus resulting in an all male jury.
• There was reason to believe that women were more sympathetic to
Dr. Spock’s actions due to his popular child rearing books.
• The defense argued that the judge in this case had a history of
venires that underrepresented women, which is contrary to the law.
• Let’s see if there is any evidence for this claim!

11

10/13/2018

Comparing Two Means
From Many Groups.
Judge

N

Xbar

Sd

Spock

9

14.6

5.04

A

5

34.1

11.94

B

6

33.6

6.58

C

9

29.1

4.59

D

2

27.0

3.81

E

6

27.0

9.01

F

9

26.8

5.97

Ho: µS = µF
Ha: µS ≠ µF

Spock Data Steps

With 2 groups estimating the
pooled SD.
Question: Suppose we wish to test
if the “S” judge’s venires are
different from the “F” judge’s.

With all 7 groups estimating the pooled SD, bigger ‘n’ greater df! More POWER!!!

sp = 6.91
P-value = .0006

Two Judge Analysis w/
t-Tools

Two Judge Analysis w/
Several-Groups
From PROC TTEST:
Estimated Diff =
Sp
=
Pooled Std. Error =
t-Statistic
=
Deg. of freedom =

Statistical Conclusion: We find
that there is substantial
evidence that the difference in
the mean percentage of
females on judge S and judge
F venires is not equal to zero.

-12.1778
5.5234
2.6038
-4.68
16

Deg. of freedom = 46 – 7 = 39

Estimated Diff = -12.1778
Sp
= 5.5234
Pooled Std. Error = 2.6038
t-Statistic
= -4.68
Deg. of freedom = 16

Two Judge Analysis:
Conclusion
Question: Suppose we wish to test
if the “S” judge’s venires are
different from the “F” judge’s.

Spock Trial QOI 2
The defense argued that the judge in this case had a history of venires that
underrepresented women, which is contrary to the law.

Answer: There is evidence that
the mean of the two groups is
different.

QOI2: Is the percent of women on recent venires of Spock’s judge
(which we will call S) significantly lower than those of 6 other judges
(which we notate A to F)?
• There are two key questions:
•

1. Is there evidence that women are underrepresented on S’s venires relative to
A to F’s?
2. Is there evidence of a difference in women’s representation on A to F’s
venires?
•The question of interest is addressed by 1
•The strength of the result in 1 would be substantially diminished if 2 is true

12

10/13/2018

Step 1: Compare Judges A - F

Spock: The Strategy

Ho: All “other” means are equal (A, B, C, D, E, F)
Ha: At least 2 “other” means are different (A, B, C, D, E, F)
But … Let’s use all the data to estimate the pooled standard deviation!

Reduced Model: µs µo µo µo µo µo µo
Full Model: µs µA µB µC µD µE µF

Different Models in SAS
At Least 2 are different (S, A, B, … F)

Different Models in SAS

µs µA µB µC µD µE µF

At Least 2 are different (S, A, B, … F)

µs µA µB µC µD µE µF
Spock is different than the Others

µs µo µo µo µo µo µo
Spock is different than the Others

µs µo µo µo µo µo µo

Comparing Two Models:
Both are not Equal Means Model

At least 2 are different (Spock, A, B, C … F)
µs µA µB µC µD µE µF

SAS (proc glm) compares models to the equal means model. When you run proc glm,
it always makes the “Corrected Total Row” the equal means model. However, we can
build our own ANOVA table (BYOA) to compare two models, both of which are not
the equal means model.
To do this we will need to identify the “full” model and the “reduced” model. The
“full” model will be the model with the most parameters (means) in it while the
“reduced model” will have fewer parameters. (Note that the equal means model
(with one parameter) is the most reduced model you can have.)

Spock is different than others
µs µo µo µo µo µo µo

F-TEST: Another Look
Ho: µA, µB, µC …. µF are Equal
Ha: At least 2 are different (A,B,C …F)

Extra Sum of Squares
Test / BYOA

Reduced : µs µo µo µo µo µo µo
Source

DF

SS

MS

F

Full: µs µA µB µC µD µE µF

Pr > F

Model

Separate (Full Model)
Error
Means Model
Corrected Total
Equal Means
Model
(Reduced Model)

Full
Reduced

Source

DF

SS

MS

F

Pr > F

Model

5

326.5

65.29

1.37

0.26

Error

39

1864.4

47.81

Corrected Total

44

2190.9

13

10/13/2018

EXTRA SUMS OF SQUARES F TEST

Step 1 Complete!

Ho: All means are equal (Spock,A,B,C…,F)

F-TEST

Ha: At least 2 are different (Spock,A,B,….F)

Ho: µA – µF are Equal
Ha: At least 2 are different (A,B, .. F)

There is not sufficient evidence to suggest that the mean percent of women on judge’s A-F
venires are different from one another (p-value = .26 from an ANOVA). Therefore, we will
now move on to Step 2 and compare Spock’s judge’s mean to the single mean that will
represent the other judges.
F-TEST: Another Look

Fail to Reject Ho

Ho: Spock is equal to Others
Ha: Spock is diff from Others

There is not sufficient evidence
at the alpha = .05 level of
significance (p-value = 0.26) to
suggest that the means are not
equal. Therefore, we will
proceed as if they are equal.

Ho: µA, µB, µC …. µF are Equal
Ha: At least 2 are different (A,B,C …F)
Source

DF

SS

MS

F

Pr > F

Model

5

326.5

65.29

1.37

0.26

Error

39

1864.4

47.81

Corrected Total

44

2190.9

Step 2!
Since we are proceeding under the assumption that the mean percentage of women
in venires of the non-Spock judges are equal, we can test whether the Spock judge has
a mean percentage different than the other judges by testing:

Ho: Mean of Spock is equal to the mean of the others.
Ha: Mean of Spock is different than the mean others.
There is strong evidence at the alpha = .05 level
of significance (p-value < .0001 from an ANOVA)
to support the claim that the mean percentage of
women in the Spock judge’s venires is less than
that of the other 6 judges and that there is no
evidence that the other 6 judges have different
mean percentages of women on their venires (pvalue = .26 from an Extra Sum of Squares F Test).
Spock’s lawyer has evidence for a mistrial.

14

Part VI

Multiple comparisons and post hoc tests

169

Chapter 25

Problem 1: Bonferroni and the Handicap
Study
The Bonferroni method was used to construct some simultaneous confidence intervals for µ1 − µ2 , µ2 − µ5 and
µ3 − µ5 , to see whether there are differences in attitude toward the mobility type of handicaps. The Bonferroni CIs
were calculated using the following SAS code: Note that lsmeans and means have the same results, because we
Code 25.1. Bonferroni in SAS
proc glm data = handicap;
class handicap;
model score = handicap;
means handicap / hovtest = bf bon cldiff;
lsmeans handicap / pdiff adjust = bon cl;
run;

are dealing with balanced data The result of this code is shown below:

170

Analysis Guide

Midterm

Figure 25.0.1. Bonferroni Confidence Intervals

Another nice way to visualize these confidence intervals is like this:

171

Analysis Guide

Midterm

Figure 25.0.2. Diffogram of the Bonferroni Confidence Intervals

As we see from these two figures, the only statistically significant mean difference was the crutches vs the hearing, which means that the attitude towards the different mobility handicaps is the same (µ1 − µ2 , µ2 − µ5 and µ3 − µ5
are not different)

172

Chapter 26

Multiple Comparison and the Handicap
Study
To generate all the multiple comparisons, and the half widths, the follwoing SAS code was used: Here we see the
Code 26.1. all the multiple comparisons in SAS
proc glm data = handicap;
class handicap;
model score = handicap;
means handicap / tukey bon scheffe LSD Dunnett('None');
run;

results of this

173

Analysis Guide

Midterm

(a) Bonferroni

(b) Tukey

(c) Dunnet

(d) Scheffe

(e) LSD

Figure 26.0.1. Half widths of different post hoc analyses in SAS

174

Analysis Guide

Midterm

We did the same thing in R, with code and output shown below:

175

Analysis Guide

Midterm

Code 26.2. Multiple comparisons with R
1
2
3

4
5
6
7
8

prob2 <- case0601
# we make none the first group so that dunnetts test behaves
prob2 $ Handicap <-factor (prob2 $Handicap , levels =c('None ', 'Amputee ', 'Crutches ', 'Hearing '
, 'Wheelchair '))
aovmodel <- aov(Score ~ Handicap , data=Handi)
# Now we can begin our tests
# Tukey 's test
tukey <- glht(aovmodel , linfct =mcp( Handicap =" Tukey "))
confint ( tukey) # Tukey

9
10
11
12

Simultaneous Confidence Intervals

13
14

Multiple Comparisons of Means: Tukey Contrasts

15
16
17

Fit: aov( formula = Score ~ Handicap , data = Handi)

18
19
20

Quantile = 2.8066
95% family -wise confidence level

21
22
23
24
25
26
27
28
29
30
31
32
33
34

Linear Hypotheses :
Estimate lwr
upr
Amputee - None == 0
Crutches - None == 0
Hearing - None == 0
Wheelchair - None == 0
Crutches - Amputee == 0
Hearing - Amputee == 0
Wheelchair - Amputee == 0
Hearing - Crutches == 0
Wheelchair - Crutches == 0
Wheelchair - Hearing == 0

-0.4714
1.0214
-0.8500
0.4429
1.4929
-0.3786
0.9143
-1.8714
-0.5786
1.2929

-2.2037 1.2608
-0.7108 2.7537
-2.5822 0.8822
-1.2894 2.1751
-0.2394 3.2251
-2.1108 1.3537
-0.8179 2.6465
-3.6037 -0.1392
-2.3108 1.1537
-0.4394 3.0251

35
36
37

# Calculated by hand
half width = 1.73225

38
39
40

41

# bonferroni ##
confint (tukey ,test= adjusted (type=" bonferroni ")) # bonferroni , we can just apply the
bonferroni to whatever
# according to the documentation

42
43

Simultaneous Confidence Intervals

44
45

Multiple Comparisons of Means: Tukey Contrasts

46
47
48

Fit: aov( formula = Score ~ Handicap , data = Handi)

49
50
51

Quantile = 2.8057
95% family -wise confidence level

52
53
54
55
56
57
58
59
60
61
62
63

Linear Hypotheses :
Estimate lwr
upr
Amputee - None == 0
Crutches - None == 0
Hearing - None == 0
Wheelchair - None == 0
Crutches - Amputee == 0
Hearing - Amputee == 0
Wheelchair - Amputee == 0
Hearing - Crutches == 0

-0.4714
1.0214
-0.8500
0.4429
1.4929
-0.3786
0.9143
-1.8714

-2.2031 1.2602
-0.7102 2.7531
-2.5817 0.8817
-1.2888 2.1745
-0.2388 3.2245
-2.1102 1.3531
176
-0.8174
2.6459
-3.6031 -0.1398

Chapter 27

Comparing groups: Education study
27.1

Assumptions

Raw Data Analysis
First, we will look at the raw data. To check if the raw data fits the assumptions, we will first look at a scatter plot.
The scatter plot of the raw data was produced by the following bit of SAS code:
proc sgplot data=EduData;
scatter x=educ y=Income2005;
run;

This results in the following plot:
Figure 27.1.1. Scatter Plot of the Raw Data

Looking at Figure 27.1.1, we see that the raw data is very heavy in between 0 and 20,000 for all categories, but
some groups spread further and wider than others, which suggests the variances may not be equal. The heaviness
177

Analysis Guide

Midterm

of the lower end of each group may also suggest a lack of normality. We will examine this further with some Box
plots. These were produced using the following chunk of SAS code:
proc sgplot data=EduData;
vbox Income2005 / category=educ
dataskin=matte
;
xaxis display=(noline noticks);
yaxis display=(noline noticks) grid;
run;

This results in the following plot:
Figure 27.1.2. Box Plot of the Raw Data

Figure 27.1.2 tells us a lot about our data. We see from the size and shape of the boxes that the variances of our
data are by no means homogeneous. Note that there are a lot of outliers while the distribution is heavily weighted
towards the bottom, this suggests our data may have departed from normality. We will examine this phenomenaa
further using histograms. To produce histograms of the raw data, the following SAS code was used:
proc sgpanel data=EduData;
panelby educ / rows=5 layout=rowlattice;
histogram Income2005;
run;

This results in the following plot:

178

Analysis Guide

Midterm

Figure 27.1.3. Histogram of the Raw Data

Figure 27.1.3 confirms our suspicions, the variances of the data are likely unequal, but more importantly, the
data is clearly skewed to the right. We will confirm this using Q-Q plots. To produce Q-Q plots of the raw data, the
following SAS code was used:
/* Normal = blom produces normal quantiles from the data */
/* To find out more, look at the SAS documentation!*/
proc rank data=EduData normal=blom out=EduQuant;
var Income2005;
/* Here we produce the normal quantiles!*/
ranks Edu_Quant;
run;
proc sgpanel data=EduQuant;
panelby educ;
scatter x=Edu_Quant y=Income2005 ;
colaxis label="Normal Quantiles";
run;

This results in the following plot:

179

Analysis Guide

Midterm

Figure 27.1.4. Q-Q Plot of the Raw Data

The Q-Q plots in Figure 27.1.4 tell us what we already know: The raw data is not normal, and does not have
equal variances. The ANOVA test is not super robust to highly skewed, long tailed data, and it relies entirely on
equal variances, so we absolutely cannot use the raw data

Transformed Data Analysis
Now we will perform a log transformation on the data and see if that helps it meet our assumptions better. To do
a log transformation, we will employ the following SAS code:
data LogEduData;
set EduData;
LogIncome=log(Income2005);
run;

We will begin our analysis of the transformed data with a scatter plot, produced with the following SAS code:
proc sgplot data=LogEduData;
scatter x=educ y=LogIncome;
run;

This results in the following plot:

180

Analysis Guide

Midterm

Figure 27.1.5. Scatter Plot of the Log-Transformed Data

As we can see in Figure 27.1.5, the groups have a much more similar size, suggesting similar variances, and the
heavy part of the scatter plot is closer to the center, in between the outliers, which tells us the log transformation
may have done a good deal towards normalizing our data. We can examine this further using Box plots. To produce
Box plots of the transformed data, the following SAS code was used:
proc sgplot data=LogEduData;
vbox LogIncome / category=educ
dataskin=matte
;
xaxis display=(noline noticks);
yaxis display=(noline noticks ) grid;
run;

This gives us the following plot:

181

Analysis Guide

Midterm

Figure 27.1.6. Box Plot of the Log-Transformed Data

Figure 27.1.6 gives us some useful information about our data. We see the boxes and whiskers are of similar
size, which tells us the variances are likely homogeneous. Furthermore, the medians and means are near each
other, and the boxes are near the center of the distribution, which suggests that the data may be normal. We will
examine these two phenomena further with histograms. To produce histograms of the log-transformed data, the
following SAS code was used:
proc sgpanel data=LogEduData;
panelby educ / rows=5 layout=rowlattice;
histogram LogIncome;
run;

This results in the following plot:

182

Analysis Guide

Midterm

Figure 27.1.7. Histogram of the Log-Transformed Data

From the spread of the histograms in Figure 27.1.7, we see two things. First, the similar width of the histograms
confirms that variances are roughly equal. Second, the shape of the histograms, and their location near the center
suggests that the data is very nearly normal. We will further examine the normality of the data using Q-Q plots. To
produce the Q-Q plots of the transformed data, the following SAS code was used:
proc rank data=LogEduData normal=blom out= LogEduQuant;
var LogIncome;
ranks LogEduQuant;
run;
proc sgpanel data=LogEduQuant;
panelby educ;
scatter x=LogEduQuant y=LogIncome ;
colaxis label="Normal Quantiles";
run;

This results in the following plot:

183

Analysis Guide

Midterm

Figure 27.1.8. Q-Q Plot of the Log-Transformed Data

Examining the previous figure, we see a confirmation of our beliefs: The log-transformed data, when plotted
against normal quantiles, is fairly normal. This means, with the log transformed data, we can reasonably assume
normality and homogeneity of variances. We have fulfilled the assumptions of the ANOVA test and now we are
ready to go!

184

Chapter 28

selection and execution
First, we run an f test to see if any of the means are different!

28.1

ANOVA

We will now perform a complete analysis of our data, using Pure ANOVA.

Problem Statement
We would like to determine whether or not at least one of the five population distributions (corresponding to
different years of education) is different from the rest.

Assumptions
As seen in Section ??, the raw data does not meet the assumption of normality nor of homogeneity of variance.
However, in Section 27.1, we proved that after a log transformation, the data does meet both of these assumptions.
The ANOVA test is fairly robust to the slight departure from normality presented by the log transformed data, and
the variances are equal. The data is clearly independent, so that assumption is met. Therefore, all assumptions of
ANOVA are met by the log transformed data.

Hypothesis Definition
In this problem, our Null (Reduced Model) Hypothesis, H0 , is that all the groups have the same distribution and our
Alternative (Full Model) Hypothesis, H1 is that the distributions are different. Mathematically, that is written as:
H0 :mediangrand
H1 :median<12

mediangrand
median12

mediangrand

median13−15

median16

We will consider our confidence level, α to be 0.05

F Statistic
To conduct this hypothesis test, the following SAS code was used:
proc glm data = LogEduData;
class educ;
model LogIncome = educ;
run;

This results in the following ANOVA Output:

185

mediangrand

mediangrand

median>16

(28.1.1)
(28.1.2)

Analysis Guide

Midterm

Figure 28.1.1. ANOVA Table

Figure 28.1.1 tells us what our F statistic is. We see that
F = 62.87

(28.1.3)

p < .0001

(28.1.4)

P-value
Figure 28.1.1 also tells us our p-value. In this case,

Hypothesis Assessment
In this scenario, we have that p < .0001 < α = .05 and therefore we reject the null hypothesis.

Conclusion
There is substantial evidence (p < 0.0001) that at least one of the distributions is different from the others.

28.2

Tukey’s test

We want to compare all of the group means to see if they are different, so we do tukey’s test! we do this with the
following SAS code: With this we see that aside from the college and graduate school educations, they are all
different. A confidence interval for these differences, the % change of the medians, is calculated by raising e to
the confidence interval, and subtracting one from that and multiplying by 100. These are shown in the following
figure:

186

Analysis Guide

Midterm

Code 28.1. Tukeys test in SAS and R
proc glm data = LogEduData;
class educ;
model LogIncome = educ;
lsmeans LogIncome / pdiff = ALL adjust=tukey cl;
run;

and the following R code (and output)
1

2
3
4
5
6

edudata <- read.csv(file='c:/ Users / david / Desktop /MSDS/ MSDS6371 / Homework / Week6 /Data/
ex0525 .csv ', header =TRUE , sep = ",")
edudata $ logincome <- log( edudata $ Income2005 )
prob3 <- edudata
aovmodel2 <- aov( logincome ~Educ ,data =prob3)
tukkey <- glht(aovmodel2 , linfct =mcp(Educ=" Tukey "))
summary ( tukkey )

7
8

Simultaneous Tests for General Linear Hypotheses

9
10

Multiple Comparisons of Means: Tukey Contrasts

11
12
13

Fit: aov( formula = logincome ~ Educ , data = prob3)

14
15
16
17
18
19
20
21
22
23
24
25
26
27

Linear Hypotheses :
Estimate Std. Error t value Pr(>|t|)
<12 - <<12 == 0
-0.32787
0.08493
>16 - <<12 == 0
0.67069
0.05624
13 -15 - <<12 == 0 0.16400
0.04674
16 - <<12 == 0
0.56987
0.05459
>16 - <12 == 0
0.99856
0.09316
13 -15 - <12 == 0
0.49187
0.08775
16 - <12 == 0
0.89775
0.09217
13 -15 - >16 == 0 -0.50669
0.06041
16 - >16 == 0
-0.10082
0.06668
16 - 13 -15 == 0
0.40588
0.05888
---

-3.861
11.926
3.509
10.439
10.719
5.606
9.740
-8.387
-1.512
6.893

0.00101
< 0.001
0.00389
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
0.54057
< 0.001

**
***
**
***
***
***
***
***
***

Figure 28.2.1. Tukey CIs on percent increase in the median

187

Analysis Guide

Midterm

Dunnett’s Test
To compare to a control, dunnets test is the best! We do this with the following SAS code: lets look at the SAS
Code 28.2. DUnnett’s test
proc glm data = LogEduData;
class educ;
model LogIncome = educ;
lsmeans LogIncome / pdiff = ALL adjust=dunnett cl;
run;

and the following R code (and output!).
1

summary ( dunnbett ) # Dunnett

2
3

Simultaneous Tests for General Linear Hypotheses

4
5

Multiple Comparisons of Means: Dunnett Contrasts

6
7
8

Fit: aov( formula = logincome ~ Educ , data = prob3)

9
10
11
12
13
14
15
16

Linear Hypotheses :
Estimate Std. Error t value Pr(>|t|)
<12 - <<12 == 0
-0.32787
0.08493
>16 - <<12 == 0
0.67069
0.05624
13 -15 - <<12 == 0 0.16400
0.04674
16 - <<12 == 0
0.56987
0.05459
---

-3.861 0.000461 ***
11.926 < 1e -04 ***
3.509 0.001818 **
10.439 < 1e -04 ***

output too!
Figure 28.2.2. SAS p values

We see that all of the groups are different from the control. We can calculate confidence intervals on how much
percent different by raising e to the power of the CI, and then subtracting one and multiplying by 100, as seen in
the next figure

188

Analysis Guide

Midterm

Figure 28.2.3. Dunnett CIs on percent increase in the median

189

Chapter 29

Unit 6 lecture slides
lol

190

10/13/2018

Overview
UNIT 6 Live Session
Contrasts
Multiple Comparison

• ANOVA provides an F-test for equality of
several means
• The main weaknesses are

• It doesn’t tell us which means are different
• It doesn’t account for any structure in the groups

(Example: Is the average treatment effect across 3
levels of treatments different from the placebo?)

• The downside to this more refined analysis is
that we need to control for the number of
comparisons we end up making

Example:
Handicap & Capability Study

Example:
Handicap & Capability Study

• Goal: How do physical handicaps affect perception of
employment qualification?

• Do subjects systematically evaluate qualifications
differently according to handicap?
• If so, which handicaps are evaluated differently?

•

(Cesare, Tannenbaum, and Dalessio “Interviewers’ decisions related to applicant handicap type and rater empathy” (1990) Human
Performance)

• The researchers prepared 5 video taped job interviews
with same actors
• The tapes differed only in the handicap of the applicant:
•
•
•
•
•

No handicap (This is the control group)
One leg amputated
Crutches
Hearing Impaired
Wheelchair

• 14 students were randomly assigned to each tape to rate
applicants: 0-10 pts (70 students total.)

1

10/13/2018

Is There Any Difference at All?
• We should begin any analysis involving several
groups by using the ANOVA framework
• If there isn’t any (statistically) significant
difference in the population means, then there is
no reason to address more refined questions
• The tapes differed only in the handicap of the
applicant:
•
•
•
•
•

No handicap (This is the control group.)
One leg amputated
Crutches
Hearing Impaired
Wheelchair

Handicap & Capability Study:
Equal Variances Assumption

Handicap & Capability Study:
Normality Assumption

There is NO visual evidence to suggest that the data are
not normally distributed. We will proceed with the
assumption of normally distributed groups.

Handicap & Capability Study:
ANOVA results

There is evidence to support the claim that at least two population means
are different from each other (p-value of 0.0301 from a 1-way ANOVA).

There is NO evidence to suggest variances are unequal.

Notice that since there is
virtually no evidence of a
difference in standard
deviations, Welch’s test is
almost identical to the pure F
ANOVA.

2

10/13/2018

Handicap & Capability Study:
More Specific Questions

Linear Combinations & Contrasts

(this requires independence)

(CONTRAST)

Handicap & Capability Study:
A Contrast

Handicap & Capability Study:
A Contrast

Calculate mean difference and standard error.

There is evidence that the sum of points assigned to
Amp & Hear handicaps is smaller than the sum of
points assigned to Crutch & Wheel handicaps at level
alpha equal to 0.05 because the CI does not contain 0.
CI: Point estimate ± multiplier* standard error

3

10/13/2018

Chapter 6: Compare with book!

Handicap & Capability Study:
In SAS

Order = data keeps
the data in the order
it came in, so that
“none” group is first
and can be assigned
a coefficient of 0.

Note the sign switch
and division by 2 of
the coefficients.

Comes in handy when doing division by hand would result in the need to input a
rounded number (example 0.33)

Handicap & Capability Study:
In SAS

Handicap & Capability Study:
In SAS
Confidence Intervals

There is evidence that the average points assigned to Amp & Hear
handicaps is smaller than the average points assigned to Crutch & Wheel
handicaps (t-tools linear contrast p-value of 0.0022). We estimate that this
difference is -1.39 pts with an associated 99% confidence interval of….
99% CI for the difference in averages of
Amp and Hear vs. Crutch and Wheel:
Point estimate ± multiplier* standard error
-1.39±2.65*0.436
-1.39±1.155
Three different ways (contrast, estimate, estimate with divisor =2) to test for the same
idea. (There are many more than three!)

(-2.55, -0.23), which of course does not include 0

4

10/13/2018

Chapter 6

Let’s Try Some from Spock Example!!
Groups: A, B, C, D, E, F, S

With no Order = data in the code, the contrasts are assigned in alphabetical
order, so that “none” group is fourth.

Contrast vector (assume alphabetical order):
Answer on Next Slide ->

Let’s Try Some from Spock Example!!

Let’s Try ANOTHER (from Spock)!!

Groups: A, B, C, D, E, F, S
Groups: A, B, C, D, E, F, S

Contrast vector (assume alphabetical order): -1 -1 -1 -1 -1 -1 6

Contrast vector (assume alphabetical order):

5

10/13/2018

Let’s Try ANOTHER (from Spock)!!
Groups: A, B, C, D, E, F, S

Let’s Try ONE MORE (from Spock)!!
Groups: A, B, C, D, E, F, S

Contrast vector (assume alphabetical order): 1 1 1 -1 -1 -1 0
ADDITIONAL QUESTION:
Why is it better to include the Spock data in the calculation of the pooled SD
(and thus the MSE) even though the hypothesis does not include it?

Let’s Try ONE MORE (from Spock)!!

Contrast vector (assume alphabetical order):
Answer on Next Slide ->

Multiple Comparison: Motivation

Groups: A, B, C, D, E, F, S

K tests

Contrast vector (assume alphabetical order): 3 0 3 -2 -2 -2 0

6

10/13/2018

Multiple Comparison: Example k = 100
Gene 1

Gene 5

Gene 9

Gene 2

Gene 6

Gene 10

Gene 98

Confidence Intervals

Gene 97

Gene 3

Gene 7

Gene 11

…
Gene 99

Gene 4

Gene 8

Gene 12

Gene 100
When we make a correction for multiple comparisons, it is the critical value in the
hypothesis test and thus the multiplier in the confidence interval that is adjusted.
*The multiplier is usually the same as the critical value for a hypothesis test.

Planned & Post-hoc Tests
A planned test is one in which you know the comparisons (tests) you
want to make before you look at the data.
If you have k planned comparisons then you need to correct for just
those k comparisons.

Post-Hoc / Unplanned Tests
Post Hoc tests are appropriate when:
1. The researcher wants to examine all
possible comparisons among pairs of group
means (or a large number of comparisons).
2. Predictions about which groups will differ
are not made prior to setting up the
analysis.

7

10/13/2018

Multiple Comparison: Bonferroni

Multiple Comparison: Tukey-Kramer

Multiplier =

For a set of Bonferroni adjusted t-tests, (α/k) we
must have normal distributions, equal spreads, and
independence (same as typical t-tests).
However, the Bonferroni correction can be extended
to tests that have no assumptions about distributions
(e.g. rank sum test). For any set of independent
parametric or non-parametric tests, the Bonferroni
correction works the same.

This approach is very conservative,
meaning that the intervals are much
wider than the nominal level,
particularly if the tests are not really
independent.

The Tukey-Kramer adjustment is a
modification to this test to
account for different sample sizes
in the groups.
Assumes normal distributions, equal spreads, independence (same as typical t-tests), and
equal group sample sizes.
More consistent than Bonferroni with respect to Type I Error but not robust to its
assumptions…. Bonferroni is a good alternative when the assumptions are violated.

Multiple Comparison: Dunnett
Many Groups to one Control

…

Studentized Range Statistic Table

Handicap / Capability Study: Data

Assumes normal
distributions, equal
spreads, and
independence (same as
typical t-tests).

Replaces t-distribution with a multivariate tdistribution (n=# of groups versus control),
where the tests are not independent.

8

10/13/2018

Handicap Data Analysis

First Test!!!

Questions of Interest:
1. Is there any evidence that at least one pair of mean
qualification scores are different from each other?
2. Let’s say we are only interested in Amputee versus None.
Test the claim the Amputee has a different mean score than
the None group.
3. Now let’s assume that we are interested in identifying
specific differences between any two of the group means.
Find evidence of any differences in the means between the
groups.
4. Next, assume that we were interested in testing the means
of the handicapped groups to the non-handicap group. Test
this claim and identify any significant differences.

Normality: Handicap Data

There is no visual evidence to suggest that the data are not
normally distributed. We will proceed with the assumption of
normally distributed groups.

Homogeneity of SD Assumption

There is no evidence to suggest variances are unequal.
Independence may be violated here. We are going to proceed anyway for
the sake of the example.

9

10/13/2018

First QOI!!!

Second QOI!!!

2. Let’s say we are only interested in Amputee versus None. Test the claim the
Amputee has a different mean score than the None group.

1. Is there any evidence that at least one pair of mean qualification scores are
different from each other?

There is sufficient evidence to suggest at the alpha = .05 level of
significance (p-value = .0301) that at least 2 of the means are different
from each other in this standard ANOVA.

Second QOI: Better approach!!!
2. Let’s say we are only interested in Amputee versus None. Test the claim the Amputee has a
different mean score than the None group.

There is not sufficient evidence to suggest that the mean qualification rating of the amputee group is
different than the group with no handicap (p-value = .4477 from a contrast using all available data). Even
though the p-values for the two tests are only slightly different, it is better to use all available data (the
procedure on the right).
Comparing a pair of means can be just a simple contrast.

The results of these tests are equivalent! There is not sufficient evidence to suggest
that the mean qualification rating of the amputee group is different than the group
without handicap. (P-value = .4678 from a t-test and an ANOVA using only these two
groups.)

Third QOI!!!

Now let’s assume that we are interested
in identifying specific differences
between any two group means. Find
evidence of any differences in the means
between the groups.

There are 10 different two sided tests conducted
here; thus, we need to adjust alpha per test to be
.05/10 = .005. With this adjustment, only one of the
tests has a statistically significant result. Therefore,
there is evidence (p-value = .0035 from a t-test) that
the crutches and hearing groups have different mean
qualification rating scores. We will provide a
confidence interval in a few slides.

10

10/13/2018

Third QOI!!!

Bonferroni Adjusted P-Values
P-values not adjusted- compare to
individual alpha

Now let’s assume that we are interested
in identifying specific differences
between any two group means. Find
evidence of any differences in the means
between the groups.

P-values adjusted- compare to familywise alpha

A 95% confidence interval for the
difference in means of the
crutches and hearing groups is
(.0779, 3.66499).

Compare to alpha = 0.05

Compare to alpha = 0.005
x 10, up to 1

Third QOI!!!

Now let’s assume that we are interested
in identifying specific differences
between any two group means. Find
evidence of any differences in the means
between the groups.

A 95% confidence interval for the
difference in means of crutches and
hearing groups is (.0779, 3.66499).

*Slightly different code from the last slide, producing slightly
different output. Note the cl versus cldiff.

4th QOI: Next, assume that we are interested in testing the means of
the handicapped groups with the non-handicapped group. Test this
claim and identify any significant differences. (Using CIs)
There is NOT sufficient evidence
in this study to suggest that there
are any differences between the
average of the means of each
handicap group and the mean of
the group without handicap.
The 95% family-wise confidence
intervals are constructed using
Dunnett’s procedure. All CIs
contain zero, thus not providing
sufficient evidence to conclude
that the difference is not zero.
(The study results do not
constitute sufficient evidence to
support the claim that any means
tested are individually different
than the control.)

Specify the
control group

11

10/13/2018

4th QOI: Next, assume that we were interested in testing the means of
the handicapped groups with the non-handicap group. Test this claim
and identify any significant differences. (Using HTs)

R Code for Handicap Example Question 1
Question 1: Reading in Data and ANOVA

Hypothesis tests also conclude that there is not sufficient evidence to suggest that there
are any differences between the means of each handicapped group and the mean of the
of the group without handicap. The above Dunnett adjusted p-values are all greater
than alpha = .05, as is visible from the table above.

R Code for Handicap Example Question 2

R Code for Handicap Example Question 3
Note: Must Load multcomp package

Note: Must Load
pairwiseCI package

Note: Must
Load
multcomp
package

12

10/13/2018

R Code for Handicap Example Question 4

Appendix

Note: Must Load multcomp package

Bonferroni’s Correction

Bonferroni’s Correction

13

10/13/2018

Bonferroni’s Correction

Multivariate distribution
• A multivariate
distribution is
distribution of a
vector of conditional
random variables.
• Bivariate normal
distribution can
easily be shown
graphically.

14

Part VII

Workflow for testing hypotheses

205

CHOOSING A HYPOTHESIS TEST
RESEARCH STRUCTURE

NORMAL DISTRIBUTION

SAMPLE SIZE

VARIANCE

DATA TRANSFORMATION

MULTIPLE HYPOTHESIS TEST

NO

ONE SAMPLE
Difference between mean of independent
samples and a hypothesized mean
Single measure or observation

parametric
ONE-SAMPLE T-TEST
Inference on means
(medians if log-transform)

YES (CLT)
NO (w/LOG TRANSFORMATION)*

EVIDENCE AGAINST
NORMALITY?
MATCHED PAIRS
Difference between same group before and
after treatment (within-groups)
Repeated measures or observations

SUFFICIENT SAMPLE
SIZE?
YES

1 Read the problem carefully. Is it a
randomized experiment or an
observational study?
2 Plot the data using histograms, box
plots, or QQ plots.
3 Determine which test to use. Do the
data satisfy the test’s assumptions?

noonparametric
SIGN TEST or
WILCOXON SINGED RANK TEST
Inference on medians

NO

HYPOTHESIS TESTING STEP-BY-STEP

4 State the null and alternative
hypotheses. Is this a one-sided or
two-sided test?
5 Select a test statistic and confidence
level (1-α). Find the critical value.

UNPAIRED TESTING (TWO SAMPLES)
Difference between independent groups
(between-groups)
Single measure or observation

parametric
POOLED TWO-SAMPLE T
Inference on means

YES

NO
SAME SAMPLE
SIZES?

NO

NO

EVIDENCE AGAINST SAME
STANDARD DEVIATION?
YES

EVIDENCE AGAINST
NORMALITY?

parametric
WELCH’S T
Inference on means

YES (CLT)

YES

SUFFICIENT SAMPLE
SIZE?

nonparametric
WILCOXON RANK SUM
(aka Mann-Whitney U Test)
Inference on medians

NO

UNPAIRED TESTING (MORE THAN TWO
SAMPLES)
Difference between independent groups
(between-groups)
Single measure or observation

parametric
WELCH’S ANOVA
Inference on means

YES
NO

EVIDENCE AGAINST SAME
STANDARD DEVIATION?

YES (w/LOG TRANSFORMATION)*

parametric
ONE-WAY ANOVA
Inference on means
(medians if log-transform)

YES (CLT)

YES

nonparametric
KRUSKAL-WALLIS
Inference on medians

SUFFICIENT SAMPLE NO
SIZE?

7 Compute the test statistic and the
probability (p-value) of obtaining
the observed results if the null
hypothesis is true.
8 Reject or fail to reject the null
hypothesis. (Never accept the null
hypothesis.)
9 Perform post hoc testing, if
applicable, to determine which
groups are different.
10 State the statistical conclusion in
the context of the original problem.

TUKEY-KRAMER
(aka TUKEY’S HSD)

DUNNETT
for comparison to a control group

YES (w/LOG-TRANSFORMATION)*
NO

EVIDENCE AGAINST
NORMALITY?

6 Sketch the distribution, including
the critical value and the
acceptance and/or rejection
region(s).

BONFERRONI CORRECTION
distribution-free, more conservative,
wider interval

REGWQ
Lower Type II error rate than either
Bonferroni or Tukey-Kramer

POST HOC TESTS

* TESTS USING LOG-TRANSFORMED
DATA (INFERENCE ON MEDIANS)

Rev. 5 (6/25/2015)
Michael Burkhardt • mburkhardt@smu.edu

Analysis Guide

Midterm

note that the nonparamteric ones do medians, kruskal is nonparametric for ANOVA

207



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 208
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref
Producer                        : LuaTeX-1.7.0
Create Date                     : 2018:10:13 16:07:55-05:00
Modify Date                     : 2018:10:13 16:07:55-05:00
Trapped                         : False
PTEX Full Banner                : This is LuaTeX, Version 1.07.0 (MiKTeX 2.9.6840 64-bit)
EXIF Metadata provided by EXIF.tools

Navigation menu