Analysis Guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 194
Download | |
Open PDF In Browser | View PDF |
MSDS 6371-405 Analysis Guide David Josephs October 13, 2018 Contents I 3 Drawing Statistical Conclusions 1 Problem 1: Randomized Experiment vs Random Sample 4 2 Problem 2: Identifying Confounding Variables 5 3 Problem 3: Identifying a Scope of Inference 6 4 Problem 4: Visual comparison of population means and a permutation test 8 12 5 Unit 1 Lecture Slides II III 19 Inferences Using the t-distribution 6 Problem 1: A one sample t test 6.1 Complete Analysis Hypothesis definition Identification of a critical value and drawing a shaded t distribution Value of Test Statistic P value Assessment of the Hypothesis test Conclusion and scope of inference Some R code 20 20 20 20 21 21 21 21 22 7 Problem 2: Two sample one sided t test 7.1 Permutation test 7.2 Two sample T test, full analysis Hypothesis definition critval and distribution Calculation of the T statistic P value hypothesis assement conclusion Incorrect calculations 7.3 Rcode 23 23 24 24 24 25 25 25 26 26 26 8 Problem 3: two sample two sided t test 8.1 Full Analysis Hypothesis Definition Critical value and shaded distribution T statistic P value Hypothesis Assessment Conclusion and Scope of inference 28 28 28 28 29 29 29 29 9 Problem 4: power 9.1 Single power curve 9.2 Multiple power curves 9.3 Calculating change in N 31 31 32 32 10 Unit 2 Lecture Slides 34 A Closer look at Assumptions 44 11 Problem 1: Two Sample T test with assumptions 11.1 Complete Analysis Assmuption checking in SAS Assumption Checking in R Complete Analysis: 1 45 45 45 47 48 Analysis Guide IV Midterm 12 Outliers and Logarithmic Transformations 51 13 Log Transformed data 13.1 Full Analysis Problem Statement: Assumptions 3.3 Hypothesis testing Statement of Hypotheses: Critical Value Calculation of the t statistic: Calculation of the p-value: 3.3.5 Discussion of the Null hypothesis Conclusion 65 65 65 65 67 67 67 68 68 68 68 14 Unit 3 Lecture slides 70 89 Alternatives to the t tools 15 Problem 2: Logging problem 15.1 Complete Rank-Sum Analysis Using SAS Problem Statement Assumptions Statement of the Hypothesis Calculation of the P-value Results of the Hypothesis Test Statistical Conclusion Scope of Inference Confirmation Using R 90 90 90 90 90 90 91 91 91 92 16 Problem 3: Welch’s Two Sample T-Test with Education Data 16.1 Problem Statement and Assumptions Problem Statement Assumptions 16.2 Complete Analysis Using SAS Statement of Hypotheses Critical t Value Calculation of the t Statistic Calculation of the p Value Results of Hypothesis Test Conclusion Scope of Inference Verification using R Preferences 93 93 93 93 94 94 94 95 95 95 95 96 96 96 17 Problem 4: Trauma and Metabolic Expenditure rank sum 17.1 Hand-Written Calculations 17.2 SAS verification 17.3 Full Statistical Analysis Problem Statement Assumptions Hypothesis definitions Critical Value Calculation of the z statistic Calculation of the p value Discussion of the hypothesis Conclusion 97 97 100 100 100 100 100 101 101 101 101 102 18 Problem 5: Autism and Yoga signed rank 18.1 Hand-Written Calculations 18.2 Verification in SAS and R Verification in SAS Verification in R 18.3 6 step Sign Rank test using SAS Statement of Hypothesis Critical Values Calculation of a Z statistic Calculation of a p value Assessment of hypothesis Conclusion 18.4 Paired t test in SAS Statement of Hypothesis Critical Values 103 103 105 105 105 105 105 105 106 106 106 106 107 107 107 2 Analysis Guide Midterm Calculation of a t statistic Calculation of a P value Assessment of Hypothesis Conclusion 18.5 Confirmation with R 18.6 Complete Statistical Analysis Assumptions Statement of Hypothesis Critical Values Calculation of a t statistic Calculation of a P value Assessment of Hypothesis Conclusion V VI 107 108 108 108 108 108 108 109 109 110 110 110 110 19 sexy ranked permutation test 112 20 Unit 4 lecture slides 114 125 ANOVA 21 Problem 1: Plots and Logged Data 21.1 Plots and Transformations Raw Data Analysis Transformed Data Analysis 21.2 Complete Analysis Problem Statement Assumptions Hypothesis Definition F Statistic P-value Hypothesis Assessment Conclusion Scope of Inference 21.3 Extra Values Value of R2 Mean Square Error and Degrees of Freedom ANOVA in R! 126 126 126 130 132 132 132 132 133 133 133 133 134 134 134 134 134 22 Extra Sum of Squares 22.1 Building the Extra Sum of Squares Anova Table 22.2 Complete Analysis Problem Statement Assumptions Hypothesis Definition F Statistic P-value Hypothesis Assessment Conclusion Scope of Inference 22.3 Degrees of Freedom and Comparison to T-Test 136 136 137 137 137 137 137 137 137 137 137 138 23 Welch’s ANOVA 23.1 Complete Analysis Problem Statement Assumptions Hypothesis Definition F Statistic P-value Hypothesis Assessment Conclusion Scope of Inference 139 139 139 139 139 139 140 140 140 140 24 unit 5 lecture slides 141 156 Multiple comparisons and post hoc tests 25 Bonferroni CIs 157 26 Multiple Comparison 159 3 Analysis Guide VII Midterm 27 Tukey’s test and Dunnett’s test 27.1 Assumptions Raw Data Analysis Transformed Data Analysis 162 162 162 164 28 Multiple samples 28.1 ANOVA Problem Statement Assumptions Hypothesis Definition F Statistic P-value Hypothesis Assessment Conclusion 28.2 Tukey’s test Dunnett’s Test 168 168 168 168 168 168 169 169 169 169 170 29 Unit 6 lecture slides 172 187 Workflow for testing hypotheses 4 List of Codes 4.1 4.2 4.3 4.4 Creating Paneled histograms in SAS Producing histograms in R Two Tailed permutation test in SAS, using manually input groups Two Tailed permutation test in R, using manually input groups 8 9 10 11 6.1 6.2 6.3 6.4 One sample t test in R with manual data input Critical value and two sided shaded t distribution using SAS One sample t test in SAS one sample t test in r 20 21 21 22 7.1 7.2 7.3 7.4 A one sided permutation test in SAS One sided shaded t distribution in SAS and Critval Two sample t test using SAS two sample t test in R 23 25 25 27 8.1 Two sided two sample t test in SAS 29 9.1 Proc power single with pooled variance 9.2 Producing several curves with proc power 31 32 11.1 Checking the assumptions of a t test in SAS 11.2 t test Assumption checking in R, Q-Q plot 11.3 t test Assumption checking in R, Histogram 45 47 48 12.1 Automatically input permutation test in SAS 12.2 Outlier removal in SAS 64 64 13.1 log transform in SAS 66 15.1 Exact rank sum test using SAS 15.2 wilcoxon rank sum test using R 90 92 16.1 welch’s t test 93 18.1 Signed Rank test in SAS 18.2 Paired T test in SAS 105 108 19.1 handcrafted rank sum test 112 21.1 Scatterplot of Raw Data Using SAS 21.2 Boxplot of Raw Data Using SAS 21.3 Histogram of Raw Data Using SAS 21.4 Q-Q of Raw Data Using SAS 21.5 Logging of Raw Data Using SAS 21.6 Scatterplot of Logged Data Using SAS 21.7 Boxplot of Logged Data Using SAS 21.8 Histogram of Logged Data Using SAS 21.9 Q-Q of Logged Data Using SAS 21.10ANOVA Test Using SAS 21.11Comparison of distributions using SAS 21.12ANOVA in R 126 127 127 128 130 130 130 131 132 133 133 135 22.1 Regrouping data using SAS 22.2 Secondary ANOVA using SAS 136 136 23.1 Welch’s ANOVA in SAS 139 25.1 Bonferroni in SAS 157 26.1 all the multiple comparisons in SAS 26.2 Multiple comparisons with R 159 161 28.1 Tukeys test in SAS and R 169 5 Analysis Guide Midterm 28.2 DUnnett’s test 170 6 Part I Drawing Statistical Conclusions 7 Chapter 1 Problem 1: Randomized Experiment vs Random Sample Question 1 What is the difference between a randomized experiment and a random sample? Under what type of study/sample can a causal inference be made? Answer to Question 1 A randomized experiment is when the the application of the experimental variable (“treatment”) is applied to subjects chosen randomly. So for example, in a study with 400 subjects, and treatments A, B, and a control group, each subject would randomly be assigned into either the control group, group A, or group B. This is done to eliminate confounding variables, as well as possible bias. In a random sample, subjects are randomly chosen from the population. This is done so that the subjects of the study can be assumed to be representative of the population as a whole. [1]. We can make causal inferences from a randomized experiment, but not from a random sample. Score: 20/20. Explanation: This answer gets full marks because it covers all of the points made in the key, it defines both random sampling and randomization in the same manner as the key. However in the future it should be less wordy. 8 Chapter 2 Problem 2: Identifying Confounding Variables Question 2 In 1936, the Literary Digest polled 1 out of every 4 Americans and concluded that Alfred Landon would win the presidential election in a landon-slide. Of course, history turned out dramatically different (see http://historymatters.gmu.edu/d/5168/ for further details). The magazine combined three sampling sources: subscribers to its magazine, phone number records, and automobile registration records. Comment on the desired population of interest of the survey and what population the magazine actually drew from. Answer To Question 2 The magazine had hoped to get a random sample, or a dichotomy of the voting population, which would be representative of the entire voting population of the country as a whole. Instead, they only polled subscribers to the magazine, phone number records, and automobile registration records. 1936 was in the height of the great depression, which means that the average American was struggling to survive. Therefore, while in the past this sampling techique had worked, this time around they ended up only sampling the wealthiest people, those who could afford phones, cars, and magazine subscriptions, and the results were not representative of the population. Without truly random sampling, “the statistical results only apply to [those] sampled”, and cannot be representative of the entire population. [2]. Therefore, itis just chance that in the previous years, the polls worked. Score: 10/10. Explanation: This answer gets full marks because it states that the poll wanted to cover all of the voters (5 points), and it identifies the actual group polled with some explanation (affluent people) (5 points). 9 Chapter 3 Problem 3: Identifying a Scope of Inference Question 3 3. Suppose we have developed a new fertilizer that is supposed to help corn yields. This fertilizer is so potent that a small vial of it sprayed over an entire field is a sufficient dose. We find that the new fertilizer results in an average yield of 60 more bushels over the old fertilizer with a p-value of 0.0001. Write up a scope of inference under the following study designs that generated this data. 1. We offer the new fertilizer at a discount to customers who have purchased the old fertilizer along with a survey for them to fill out. Some farmers send in the survey after the growing season, reporting their crop yield. From our records, we know which of these farmers used the new fertilizer and which used the old one. 2. When a customer makes an order, we randomly send them either the old or new fertilizer. At the end of the season, some of the farmers send us a report of their yield. Again, from our records, we know which of these farmers used the new fertilizer and which used the old. 3. When a customer makes an order, we randomly send them either the old or new fertilizer. At the end of the season, we sub-select from the fertilizer orders and send a team out to count those farmers’ crop yields. 4. We offer the new fertilizer at a discount to customers who have purchased the old fertilizer. At the end of the season, we sub-select from the fertilizer orders and send a team out to count those farmers’ crop yields. From our records, we know which of these farmers used the new fertilizer and which used the old one. Answer 1. We cannot make causal inferences or inferences about the population, as it was not randomized or a random sample. Available units from distinct groups were selected, however the treatment was not assigned randomly, which may mean only farmers who needed a change in fertilizer or were struggling and could not afford the old fertilizer decided to go for the discount, and then the study is also only representative of those who submitted reports, as no random sampling was done Score: 8/8. Explanation: This answer gets full credit because it states that causal inferences cannot be made and that population inferences cannot be made, which agrees with the key 2. We can make causal inferences but not inferences about the population. The treatment was applied at random to the subjects, but no random sampling was done. Therefore this study only speaks to the effect of the treatment on farmers who submitted reports, which may mean that they had noteably different yields. Score: 8/8. Explanation: This answer receives full credit because it states that causal inferences can be made, and that population statements cannot be made, with explanations, all agreeing with the key 3. We can make causal inferences and inferences about the population. The farmers were randomly assigned different treatments, which allows us to make causal inferences, and then the farmers were randomly selected for the yield to be counted, which means that the selected farmers should be representative of the entire population. With these experimental parameters, we can decide whether the new fertilizer worked better, worse, or the same. Score: 7/8. Explanation: This answer loses a point because the problem does not explicitly state that the sub sample was random. I assumed it was a random sample, and with that assumption, the answer is entirely correct, however the randomness is not explicitly stated. Therefore a point is taken away. The rest of the answer agrees entirely with the key, therefore no more points will be lost 4. We can make inferences about the population but not causal inferences. The treatment was not supplied randomly, so maybe only farmers who needed a discount or the old fertilizer wasnt working for 10 Analysis Guide Midterm chose the new fertilizer. However, they were randomly sampled, which means we can make inferences about the population to some degree but we definitely cannot make causaul inferences. Score: 7/8. Explanation: This answer loses a point because the problem does not explicitly state that the sub sample was random. I assumed it was a random sample, and with that assumption, the answer is entirely correct, however the randomness is not explicitly stated. Therefore a point is taken away. The rest of the answer agrees entirely with the key, therefore no more points will be lost. 11 Chapter 4 Problem 4: Visual comparison of population means and a permutation test Question 4 4. A Business Stats class here at SMU was polled, and students were asked how much money (cash) they had in their pockets at that very moment. The idea was to see if there was evidence that those in charge of the vending machines should include the expensive bill / coin acceptor or if the machines should just have the credit card reader. Also, a professor from Seattle University polled her class last year with the same question. Below are the results of the polls. SMU 34, 1200, 23, 50, 60, 50, 0, 0, 30, 89, 0, 300, 400, 20, 10, 0 Seattle U 20, 10, 5, 0, 30, 50, 0, 100, 110, 0, 40, 10, 3, 0 1. Use SAS to make a histogram of the amount of money in a student’s pocket from each school. Does it appear there is any difference in population means? What evidence do you have? Discuss your thoughts. 2. Use the following R code to reproduce your histograms. Simply cut and paste the histograms into your HW. SMU = c(34, 1200, 23, 50, 60, 50, 0, 0, 30, 89, 0, 300, 400, 20, 10, 0) Seattle = c(20, 10, 5, 0, 30, 50, 0, 100, 110, 0, 40, 10, 3, 0) hist(SMU) hist(Seattle) 3. Run a permutation test to test if the mean amount of pocket cash from students at SMU is different than that of students from Seattle University. Write up a statistical conclusion and scope of inference (similar to the one from the PowerPoint). (This should include identifying the Ho and Ha as well as the p-value.) Answer 1. Code (see Appendix 1) for the SAS histogram (Figure 1) was inspired by [3]. The code used to produce this histogram is as follows: Code 4.1. Creating Paneled histograms in SAS proc sgpanel data=CashMoney; panelby School / rows=2 layout=rowlattice; histogram cash / binwidth = 25; run; 12 Analysis Guide Midterm Figure 4.0.1. Distribution of Cash by School, produced in SAS It appears that for the sample means, the SMU sample has a slighly higher mean, however I do not believe that means that the population of SMU has a higher mean than Seattle U, as this was not a random sample, it was just of business students. It appears that the SMU cash distribution is wider, with higher values, but again it is hard to tell if it is indicative of the entire population, I believe, based off of where the majority of the distributions lie, both populations would have similar means, with SMU having a slightly higher mean. SMU is a private school and Seattle U is one of the best value schools in the country, so it is possible that SMU students might have in general, more money than students at Seattle U, and therefore more cash. Score: 5/5. Explanation: This receives full marks, the histograms are correct and the conclusions are similar to the key, and are very logical. The code is included in the appendix. 2. The code used to generate the R histograms (Figure 2) was given in the homework and is presented below Code 4.2. Producing histograms in R 1 2 3 4 5 SMU = c(34, 1200, 23, 50, 60, 50, 0, 0, 30, 89, 0, 300, 400, 20, 10, 0) Seattle = c(20, 10, 5, 0, 30, 50, 0, 100, 110, 0, 40, 10, 3, 0) par(mfrow=c(1,2)) hist(SMU) hist(Seattle) Figure 4.0.2. Cash Distributions at SMU and Seattle U, Produced using R he code used to generate the permutation test (Appendix 2), using SAS, is given in [4]. The results of the permutation test, with 999999 permutations can be seen in Figure 3 Below is SAS and R code for permutation tests: 13 Analysis Guide Midterm Code 4.3. Two Tailed permutation test in SAS, using manually input groups proc iml; G1 = {/*SMU student data*/}; G2 = {/*Seattle U student data*/}; obsdiff = mean(G1) - mean(G2); /*difference in the means of the two data sets*/ print obsdiff; call randseed(12345); /* set random number seed */ alldata = G1 // G2; /* stack data in a single vector */ N1 = nrow(G1); N = N1 + nrow(G2); NRepl = 999999; /* number of permutations, I did ~ 1 million just because I though nulldist = j(NRepl,1); /* allocate vector to hold results */ do k = 1 to NRepl; x = sample(alldata, N, "WOR"); /* permute the data */ nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]); /* difference of means */ end; title "Histogram of Null Distribution"; refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);";/*build call Histogram(nulldist) other=refline; pval = (1 + sum(abs(nulldist) >= abs(obsdiff))) / (NRepl+1); print pval;/*calculat /*https://blogs.sas.com/content/iml/2014/11/21/resampling-in-sas.html*/ Figure 4.0.3. Results of Permutation Tests And some R code: In this test, the null hypothesis is that there is no difference between the mean amount of cash in a student’s pocket in the two groups, while the alternative hypothesis is that there is a meaningful difference between the two[4]. The permutations were used to generate the null distribution of differences, and the red line shows where the experimental difference lies. Further calculation shows that the p value of the experimental mean was 0.149, meaning about 15% of the null distribution is greater than our mean[5]. With a 5 or 10 % confidence interval, we cannot reject the null hypothesis, and therefore we cannot say there is any difference between the two means. The SMU students and Seattle U students have more or less the same amount of cash in their pockets, the result of the study does not bear statistical inference. As for scope of inference, this was not a randomized experiment or random sample, and therefore we cannot make any causal inferences (there was no treatment applied, and we definitely cannot say going to SMU makes you have more or less money in your pocket than going to Seattle U), and we cannot make any inferences about the student bodies as a whole (population inferences). The sample is only representative of the students sampled, so we have very little scope of inference. Score: 15/15. Explanation: This receives full marks, 5 points for running the test, 5 points for the p value, and 5 points for mentioning the null and alternative hypotheses and getting the correct conclusion. The code is included in the Appendix. 14 Analysis Guide Midterm Code 4.4. Two Tailed permutation test in R, using manually input groups 1 2 3 4 school1 <- rep(’SMU’, 16) school2 <- rep(’Seattle ’, 14) school <- as.factor(c(school1 , school2)) all.money <- data.frame(name=school , money=c(SMU , Seattle)) 5 6 7 8 9 10 t.test(money ~ name , data=all.money) number_of_permutations <- 1000 xbarholder <- numeric (0) counter <- 0 observed_diff <- mean(subset(all.money , name == "SMU")\$money)-mean(subset(all .money , name == "Seattle")\$money) 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 set.seed (123) for(i in 1: number_of_permutations) { scramble <- sample(all.money\$money , 30) smu <- scramble [1:16] seattle <- scramble [17:30] diff <- mean(smu)-mean(seattle) xbarholder[i] <- diff if(abs(diff) > abs(observed_diff)) counter <- counter + 1 } hist(xbarholder , xlab=’Permuted SMU - Seattle ’, main=’Histogram of Permuted Mean Differences ’) box() pvalue <- counter / number_of_permutations pvalue observed_diff 15 Chapter 5 Unit 1 Lecture Slides 16 10/12/2018 Symbols! MSDS 6371: Lecture 1 Standard Mean Deviation Variance Sample Population DRAWING STATISTICAL CONCLUSIONS R A N D O MI ZE D E X P E RI ME NT S V. O BS E RVAT IO NAL S T U DIE S R A N D O M S A M P LE S V. S E L F - SEL ECTI ON Creativity Scores: Intrinsic vs. Extrinsic Motivation Subjects volunteered for the study. Then, treatments were randomly assigned. Starting Salaries: Female vs. Male Subjects were NOT randomly chosen by the researcher (all employees at a bank were included), and the group assignments were not random either. If a random sample of the employees had been used… 1 10/12/2018 Causal Inference: Randomized vs. Observational Study Types of Studies Creativity Study • Causal inferences can be drawn from randomized experiments Randomized Experiment • Causal inferences cannot be drawn from observational studies due to CONFOUNDING CONFOUNDING VARIABLE: Related to both group membership and to the outcome Salary Study Example: Since 2000, the U.S. median wage… Observational Study •has overall increased about 1% •has decreased for high school (or below) dropouts and high school graduates (no college) •Is this a paradox? Why do an observational study? Causal Inference: Randomized vs. Observational Study • Causal inferences can be drawn from randomized experiments • Establishing causation not always the goal • Causal inferences cannot be drawn from observational studies due to CONFOUNDING What are some possible confounding variables in the gender/salary study? In the starting salaries study, maybe males have • more education • more seniority • more age (older) • more willingness to negotiate starting salary No, more people are going to college. • Predict whether or not an email is spam •Randomization may not be ethical • Assign subjects of a clinical trial of a cancer drug to treatment or placebo Older Younger o y y o y y y o y o o o o o y y y o y o y y o y y o y y y o y o o y o o o y y y o y o y y In a randomized experiment, variables like age are also randomly distributed to each group, removing the confounding effect. •May be arguable scientifically that a confounder is “unlikely” • 6 month smoking ban in Helena, MT coinciding with 40% reduction in heart attacks • Might have an incidentally observed dataset • Walmart collects petabytes of data/day. Should this data be discarded because it is observational? 2 10/12/2018 Inference to Populations: Random Sample vs. Self-Selection Inference to Populations: Random Sample vs. Self-Selection • Inference to populations can be drawn from a RANDOM SAMPLE FROM THAT POPULATION. • Inference to populations can be drawn from a RANDOM SAMPLE • Inference to populations cannot be drawn if units are self-selected. In this creativity example, inference can only be drawn to the subjects in the sample that was taken. • Inference to populations cannot be drawn if units are self-selected RANDOM SAMPLE: Experimental units selected via a “chance mechanism” from a well defined population Example: call randomly selected phone numbers for a survey. • What is the population from which the sample is taken? If drawing from a physical phone book, is it the people who live in the city? • Would this sampling method result in inferences to different populations if it were used in 1950? 1990? Present day? SIMPLE RANDOM SAMPLE: Every subset of size n is equally likely Example: I’ll assign everyone in this class a random integer 17, 200, -3, 472, … and survey the n people (units) with smallest numbers Statistical Inferences Permitted by Study Design •WHICH OF THE STUDIES USES RANDOM SAMPLING? • Neither study uses random sampling • Creativity study: units are volunteers • Bank study: units are the entire staff • No inference about a larger population is possible • Does not mean the results are not interesting or compelling! x Practice with Scope: Q1 A particular study focused on high school freshman and seniors and their GPAs in a required economics class. The study consisted of enumerating every freshman and senior in the school and randomly selecting them from that sampling frame. Their scores in the economics class were then recorded, and a hypothesis test for the difference of means was conducted. The seniors were found to have a significantly greater mean score in the class than the freshman. What sort of conclusions can be made from this study? In other words, what is the scope of this study? In this class, scope typically constitutes both the causal inferences and populations inferences. Since the subjects cannot be randomly assigned to be freshman or seniors, this is an observational study, and thus the difference in mean scores is only associated with the freshman / senior status. We can’t tell if the class (freshman or senior) caused the difference or not. The sample was a random sample from the school; therefore, these findings can be generalized to all freshman and seniors in the school. In conclusion, it can be inferred that the mean economics score of the seniors in the school is greater than that of the freshman although the cause of this difference cannot be determined from this study. 3 10/12/2018 Practice with Scope: Q2 x The Navy is very interested in the effects of sleep deprivation on cognitive ability. In order to test the effect, the Navy put out a radio advertisement asking for 18 to 35 year old nonsmokers to participate in the study. The volunteers were then placed in either the control group (no sleep deprivation) or the treatment group (36 hours of sleep deprivation) based on the flip of a fair coin (Heads = Control, Tails = Treatment). After the data was collected, the sleep deprived group was found to have a significantly lower mean math score than the group not deprived of sleep. What sort of conclusions can be made from this study? In other words, what is the scope of this study (causal inferences and population inferences)? Drawing Statistical Conclusions Since the subjects were randomly assigned to the control and treatment groups, this is a randomized experiment; thus, the difference in mean scores can be concluded to be caused by the sleep deprivation. Since the subjects were volunteers who responded to a radio advertisement, it is easy to see that every member of the population did not have the same chance of being selected, and thus the sample is NOT a random sample. Therefore these findings cannot be generalized to all U.S. nonsmokers between the age of 18 and 35. In conclusion, it can be inferred that sleep deprivation caused the decrease in cognitive ability (as measured by the timed math test) for these 57 individuals only. M EA S U R ING U N C E RTA IN TY IN RA N D O MI ZED A N D O B S E RVATIO NAL S T UD IES Creativity Study Creativity Study 4 out of 6 groupings have test statistics as extreme or more extreme than the original grouping. As extreme or more extreme means the absolute value of the test statistic is at least 4.5. So the p-value is 4/6 = 0.667. This answers the question of how unusual our test statistic would be if the treatments had the same effect. For the sake of the example, supposed there are only 4 subjects. I I E Int (Grp 1) Ext (Grp 2) 12 Bob 5 Dan 17 Sue 15 Sal Avg. 14.5 Avg. 10 To quantify “large,” we can randomly reallocate units to two groups and recompute the difference in sample means many times. *Everyone has the same score with each grouping. The group each person is artificially put in changes with each regrouping. If the treatments had the same effect, then each participant would have the same score regardless of grouping. Diff 14.5 – 10 = 4.5 (NULL HYPOTHESIS) All other possible groupings: (TEST STATISTIC) (ALTERNATE HYPOTHESIS) (Grp 1) (Grp 2) 12 Bob 17 Sue 5 Dan 15 Sal Avg. 8.5 Avg. 16 Diff 8.5 – 16 = -7.5 (Grp 1) (Grp 2) (Grp 1) (Grp 2) 12 Bob 5 Dan 15 Sal 5 Dan 15 Sal 17 Sue 17 Sue 12 Bob Avg. 13.5 Avg. 11 Avg. 16 Avg. 8.5 Diff 13.5 – 11 = 2.5 Diff 16 – 8.5 = 7.5 (Grp 1) (Grp 2) (Grp 1) 5 Dan 12 Bob 5 Dan 12 Bob 17 Sue 15 Sal 15 Sal 17 Sue Avg. 11 Avg. 13.5 Avg. 10 Avg. 14.5 Diff 11 – 13.5 = -2.5 (Grp 2) Diff 10 – 14.5 = -4.5 4 10/12/2018 Creativity Study: all 47 subjects Creativity Study: Testing the Hypothesis Number of random regroupings: 1.6 x 1013 I E 1000 different groupings (relabelings)* Half a year with a computer that can perform a million calculations per second! 4.14 -4.14 (P-VALUE) *Everyone has the same score with each grouping. What group each person is artificially put in changes with each regrouping. If the treatments had the same effect, then each participant would have the same score regardless of grouping. Creativity Study Creativity Study (go to SAS code) treatment 0 1 Diff (1-2) Diff (1-2) Method Pooled Satterthwaite Mean 19.8833 15.7391 4.1442 4.1442 95% CL Mean 18.0087 21.7580 13.4677 18.0105 1.2914 6.9970 1.2776 7.0108 4.14 -4.14 The TTEST Procedure Variable: score 1000 different groupings (relabelings) Std Dev 4.4395 5.2526 4.8541 95% CL Std Dev 3.4504 6.2276 4.0623 7.4343 4.0261 6.1138 Obs Variable Class Method Variances Mean Low erCLMean UpperCLMean StdDev Low erCLStdDev UpperCLStdDev UMPULow erCLSt UMPUUpperCLSt dDev dDev 1 COL139 Diff (1-2) Pooled Equal 4.4678 1.6594 7.2762 4.7786 3.9635 6.0187 3.9360 5.9708 2 COL170 Diff (1-2) Pooled Equal -4.3192 -7.1485 -1.4899 4.8141 3.9930 6.0634 3.9653 6.0152 3 COL279 Diff (1-2) Pooled Equal -4.5576 -7.3530 -1.7623 4.7564 3.9451 5.9908 3.9178 5.9430 4 COL360 Diff (1-2) Pooled Equal -4.8897 -7.6340 -2.1454 4.6695 3.8731 5.8814 3.8462 5.8345 5 COL537 Diff (1-2) Pooled Equal 7.2031 4.7991 3.9806 6.0446 3.9530 5.9964 4.3826 1.5621 6 COL551 Diff (1-2) Pooled Equal -5.0514 -7.7692 -2.3337 4.6243 3.8356 5.8245 3.8090 5.7781 7 COL604 Diff (1-2) Pooled Equal -4.7109 -7.4832 -1.9385 4.7172 3.9127 5.9415 3.8855 5.8942 8 COL664 Diff (1-2) Pooled Equal 4.6636 1.8840 7.4431 4.7295 3.9228 5.9569 3.8956 5.9095 There is strong evidence to suggest that the mean score of those who receive intrinsic motivation is not equal to those who receive the extrinsic motivation (p-value = .008). The burden to reject the null hypothesis is lower under a one-sided test, so we can say that the evidence supports the claim that the intrinsic mean is higher than the extrinsic mean. Since this was a randomized experiment, we can conclude that the intrinsic motivation caused this increase. In addition, since these were volunteers, this inference can only be assumed to apply to these 47 subjects, although the findings are very intriguing. 5 10/12/2018 From Randomized to Observational Studies •In the Creativity study, the Intrinsic/Extrinsic groups were randomly assigned to subjects •This motivated comparing the observed difference to re-randomized difference to test a hypothesis about the questionnaire having no effect •This is known as a RANDOMIZATION TEST •In observational studies, the groups are not randomly assigned Appendix •Though not technically the same test, we can still apply exactly the same re-randomization idea to observational data •However, now it is called a PERMUTATION TEST Age Discrimination In the United States, it is illegal to discriminate against people based on various attributes. One such attribute is age. An active lawsuit, filed August 30, 2011, in the Los Angeles District Office is a case against the American Samoa Government for systematic age discrimination by preferentially firing older workers. Is there evidence for age discrimination in this study? Data sampled at random from all American Samoa government workers: Age Discrimination (Two Sided) Fired 34 37 37 38 41 42 43 44 44 45 45 45 46 48 49 53 53 54 54 55 56 Not fired 27 33 36 37 38 38 39 42 42 43 43 44 44 44 45 45 45 45 46 46 47 47 48 48 49 49 51 51 52 54 -1.9238 1.9238 1000 different groupings (relabelings) Fired 34 37 37 38 41 42 43 44 44 45 45 45 46 48 49 53 53 54 54 55 56 Not fired 27 33 36 37 38 38 39 42 42 43 43 44 44 44 45 45 45 45 46 46 47 47 48 48 49 49 51 51 52 54 There is not sufficient evidence to suggest that the mean age of those who were fired is different from the mean age of those who were not fired (p-value = 0.204). The p-value is so high that even the null hypothesis of a one-sided test cannot be rejected. (There is insufficient evident to claim that the mean age of fired employees is greater than that of not fired employees.) Since this was a random sample of government employees in Samoa, we can generalize this inference to all government-employed people in Samoa. Note: since we FTR (fail to reject) Ho, there is no need to discuss causation or association. 6 Part II Inferences Using the t-distribution 23 Chapter 6 Problem 1: A one sample t test Question 1 The world’s smallest mammal is the bumblebee bat, also known as the Kitti’s hog nosed bat. Such bats are roughly the size of a large bumblebee! Listed below are weights (in grams) from a sample of these bats. Test the claim that these bats come from the same population having a mean weight equal to 1.8 g. (Beware: This data is NOT the same as in the lecture slides!) Sample: 1.7 1.6 1.5 2.0 2.3 1.6 1.6 1.8 1.5 1.7 1.2 1.4 1.6 1.6 1.6 1. Perform a complete analysis using SAS. Use the six step hypothesis test with a conclusion that includes a statistical conclusion, a confidence interval and a scope of inference (as best as can be done with the information above . . . there are many correct answers given the vagueness of the description of the sampling mechanism.) 2. Inspect and run this R Code and compare the results (t statistic, p-value and confidence interval) to those you found in SAS. To run the code, simply copy and paste the below code into R. Code 6.1. One sample t test in R with manual data input sample = c(1.7, 1.6, 1.5, 2.0, 2.3, 1.6, 1.6, 1.8, 1.5, 1.7, 1.2, 1.4, 1.6, 1.6, 1.6) t.test(x=sample , mu = 1.8, conf.int = "TRUE", alternative = "two.sided") 1 2 Answer 6.1 Complete Analysis Hypothesis definition H0 :µ = 1.8 (6.1.1) H1 :µ , 1.8 (6.1.2) Identification of a critical value and drawing a shaded t distribution We have that n = 15 → df = n − 1 = 14, α = 0.05. We input this into SAS and get our lovely shaded distribution and critical value with the following code: This gives us a critical t value of ±2.14479, as seen in the following figures: Figure 6.1.1. Critical t value 24 Analysis Guide Midterm Code 6.2. Critical value and two sided shaded t distribution using SAS data critval; p = quantile("T",.975,14); /*two sided test*/; proc print data=critval; run; data pdf; do x = -4 to 4 by .001; pdf = pdf("T", x, 14); if x <= quantile("T",.025,14) then lower = pdf; else lower = 0; if x >= quantile("T",.975,14) then upper = pdf; else upper = 0; output; end; run; title ’Shaded t distribution’; proc sgplot data=pdf noautolegend noborder; yaxis display=none; band x = x lower = lower upper = upper / fillattrs=(color=gray8a); series x = x y = pdf / lineattrs = (color = black); series x = x y = lower / lineattrs = (color = black); run; Value of Test Statistic The t statistic was calculated using the following SAS code Code 6.3. One sample t test in SAS proc ttest data=bats h0=1.8 sides=2 alpha=0.05; run; t= x̄ − µ √s n ≈ 1.65 − 1.8 0.25 15 = −2.35 P value This gives us a p-value of p = 0.0342 Assessment of the Hypothesis test From here we can see that p = .0342<α = .05, indicating that we REJECT the null hypothesis, which claims that µ = 1.8 Conclusion and scope of inference We cannot say that this sample of bats comes from a population with a mean weight of 1.8 grams (p value = 0.0242 from a two sided t test). Below is a graph produced with the code from step 4 which shoes a 95% confidence interval on the distribution of the data (green) vs the null hypothesis(gray bar) 25 Analysis Guide Midterm The mean of 1.8 lies outside the reasonable range of the data from the sample, and as our hypothesis test showed, vice versa is also true. We cannot say that our sample of bats has a mean weight of 1.8, and it is difficult to say that it came from a population of mean 1.8. However, we cannot make any conclusions about the population this sample came from, because it is not a random sample (we also clearly cant make any causal inferences), We only know, with 95% confidence, that our sample does not have a mean of 1.8 grams, and that is about all we can say. Some R code Code 6.4. one sample t test in r 1 2 3 4 sample <- c(1.7, 1.6, 1.5, 2.0, 2.3, 1.6, 1.6, 1.8, 1.5, 1.7, 1.2, 1.4, 1.6, 1.6, 1.6) t.test(x=sample , mu = 1.8, conf.int = "TRUE", alternative = "two.sided") 26 Chapter 7 Problem 2: Two sample one sided t test Question 2. In the United States, it is illegal to discriminate against people based on various attributes. One example is age. An active lawsuit, filed August 30, 2011, in the Los Angeles District Office is a case against the American Samoa Government for systematic age discrimination by preferentially firing older workers. Though the data and details are currently sealed, suppose that a random sample of the ages of fired and not fired people in the American Samoa Government are listed below: Fired 34 37 37 38 41 42 43 44 44 45 45 45 46 48 49 53 53 54 54 55 56 Not fired 27 33 36 37 38 38 39 42 42 43 43 44 44 44 45 45 45 45 46 46 47 47 48 48 49 49 51 51 52 54 a. Perform a permutation test to test the claim that there is age discrimination. Provide the Ho and Ha, the p-value, and full statistical conclusion, including the scope (inference on population and causal inference). Note: this was an example in Live Session 1. You may start from scratch or use the sample code and PowerPoints from Live Session 1. b. Now run a two sample t-test appropriate for this scientific problem. (Use SAS.) (Note: we may not have talked much about a two-sided versus a one-sided test. If you would like to read the discussion on pg. 44 (Statistical Sleuth), you can run a one-sided test if it seems appropriate. Otherwise, just run a two-sided test as in class. There are also examples in the Statistics Bridge Course.) Be sure to include all six steps, a statistical conclusion, and scope of inference. c. Compare this p-value to the randomized p-value found in the previous sub-question. d. The jury wants to see a range of plausible values for the difference in means between the fired and not fired groups. Provide them with a confidence interval for the difference of means and an interpretation. f. Inspect and run this R Code and compare the results (t statistic, p-value, and confidence interval) to those you found in SAS. To run the code, simply copy and paste the code below into R. Answers 7.1 Permutation test First, a permutation test is ran using n = 9999, using the code I wrote in homework one, inspired by [2]. The code used to run the permutation test is shown below: In this scenario, we have that: Code 7.1. A one sided permutation test in SAS obsdiff = mean(G1) - mean(G2); /*G1 and G2 represent the two groups*/ print obsdiff; call randseed(12345); /* set random number seed */ alldata = G1 // G2; /* stack data in a single vector */ N1 = nrow(G1); N = N1 + nrow(G2); NRepl = 9999; /* number of permutations */ nulldist = j(NRepl,1); /* allocate vector to hold results */ do k = 1 to NRepl; x = sample(alldata, N, "WOR"); /* permute the data */ nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]); /* difference of means */ end; title "Histogram of Null Distribution"; refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);"; call Histogram(nulldist) other=refline; pval = (1 + sum(abs(nulldist) >= (obsdiff))) / (NRepl+1); print pval; H0 :µf − µuf ≤ 0 H1 :µf − µuf > 0 27 Analysis Guide Midterm where the null hypothesis is that the average age of the unfired individuals is the same as the average age of the fired individuals, and the alternative is that the average age of the individuals who were fired is higher. The results of the permutation test are as follows: In the above figure, the red line represents the mean of the difference between the two samples, and the rest of the bars represent our null distribution. SAS tells us that the P-value is 0.2812, meaning 28.12 percent of the null distribution is greater than our sample mean. Therefore, with a 5%, or even a 10% confidence interval, we cannot reject the null hypothesis. We cannot say whether or not there was age discrimination in the firing of workers with the given sample. With this procedure, we can make generalizations about the population, and generalize about all of the government-employed people in Samoa, as we did a random sample, however, we cannot make causal inferences, as there may be confounding variables in the system, and we did not run a randomized experiment. There is also no need to discuss causal problems, because we failed to reject the null hypothesis. 7.2 Two sample T test, full analysis This time we will conduct a t test on the two data sets to determine whether age discrimination occured or not. Because we believe the older workers may have been fired, we are going to perform a one sided t-test. Hypothesis definition First we construct our hypotheses: H0 :µf − µuf ≤ 0 H1 :µf − µuf > 0 critval and distribution Next we draw and shade our distribution: In a two sample t-test, we have that: df = nf + nnf − 2 where in our case, df = 21 + 30 − 2 = 49, α = 0.05 Now we input this information into SAS to draw our distribution[1]: Giving us this lovely graph: Next we find a number for the critical value, using the same code as problem 1: 28 Analysis Guide Midterm Code 7.2. One sided shaded t distribution in SAS and Critval data pdf; do x = -4 to 4 by .01; pdf = pdf("T", x, 49); lower = 0; if x >= quantile("T",0.95,49) then upper = pdf;/*one sided*/ output; end; run; title ’Shaded t distribution’; proc sgplot data=pdf noautolegend noborder; yaxis display=none; band x = x lower = lower upper = upper / fillattrs=(color=gray8a); series x = x y = pdf / lineattrs = (color = black); series x = x y = lower / lineattrs = (color = black); run; else upper = 0; data critval; p = quantile("T",.95,49); /*one sided test*/; proc print data=critval; run; This gives us a critical t value of 1.67655. Calculation of the T statistic Next we calculate our two sample t statistic using SAS: Code 7.3. Two sample t test using SAS proc ttest data=samoa alpha=.05 test=diff sides=U; class fired; var age; run; Which tells us that our t statistic is 1.10 P value With the code from the previous step, we also see the p value: p = 0.1385 hypothesis assement p = 0.1385 > α = 0.05 for the one tailed hypothesis test, indicating that we CANNOT REJECT the null hypothesis 29 Analysis Guide Midterm conclusion The p value for the t test was about half of the p value for the random test, I believe this is because I ran a one-sided t test. It is interesting to note that if you do a two sided t-test in SAS, you get roughly the same value for p as in the permutation test: This means that maybe a permutation test is a good estimator of the two-sided t-test. We cannot reject the null hypothesis, meaning we cannot say that older workers were fired from the samoan government. Note that we used a one tailed hypothesis test in this scenario, as we wanted to deternine if the fired group was OLDER than the nonfired group. As a result of this test, we cannot say that the fired group was older than the unfired group, and since this sample was random, we can say the same thing about the entire samoan government. However, we cannot make causal inferences and there is no need to because we did not reject the null hypothesis We can provide a lot of confidence intervals for the jury. I think the most telling is the one sided confidence interval, which would tell us what difference in the means constitutes age discrimination. This was produced using the following SAS code: proc ttest data=samoa alpha=.05 test=diff sides=U; /*an upper tailed test*/ class fired; var age; run; which gives us a confidence interval of [−1.0107, ∞). This confidence interval represents the upper difference of means at a 95% confidence level. We can interpret this as follows: if the confidence interval contains the null hypothesis, then we cannot reject it. However if it does not contain the null hypothesis, we must reject it. As we can see in this beautifully drawn figure, the null hypothesis, µf −µnf ≤ 0 is contained within our CI: . This means we cannot reject the null hypothesis, we cannot say there was age discrimination. It is plausible that the mean differnence of the entire population of samoan government employees is less than or equal to zero, as it is within the 95% confidence interval, which means we cannot, as objective jurors, claim there was age discrimination. Incorrect calculations The pooled sample standard deviation, sp , is defined as Pk sp2 which for us is: r sp = = Pi=1 k (ni − 1)si2 i=1 (ni − 1) (21 − 1)(6.5214)2 + (30 − 1)(5.8835)2 = 6.152 20 + 29 The equation for standard error in the difference of means is given as s s12 s22 σx¯1 −x¯2 = + n1 n2 Which gives us that r σx¯1 −x¯2 = 7.3 6.52142 5.88352 + = 1.811 21 30 Rcode The following code (supplied in the homework) was put into R: returning this: 30 Analysis Guide Midterm Code 7.4. two sample t test in R 1 2 3 4 5 6 7 8 1 2 3 4 5 6 Fired = c(34, 37, 37, 38, 41, 42, 43, 44, 44, 45, 45, 45, 46, 48, 49, 53, 53, 54, 54, 55, 56) Not_fired = c(27, 33, 36, 37, 38, 38, 39, 42, 42, 43, 43, 44, 44, 44, 45, 45, 45, 45, 46, 46, 47, 47, 48, 48, 49, 49, 51, 51, 52, 54) t.test(x = Fired , y = Not_fired , conf.int = .95, var.equal = TRUE , alternative = "greater") Two Sample t-test data: Fired and Not_fired t = 1.0991 , df = 49, p-value = 0.1385 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: -1.010728 Inf sample estimates: mean of x mean of y 45.85714 43.93333 The results are near identical, I cannot tell which one is better but I imagine R is more accurate as well, but just a very small difference between the results in all regards . The var.Equal statement is important because it uses the pooled test. 31 Chapter 8 Problem 3: two sample two sided t test Question 3. In the last homework, it was mentioned that a Business Stats professor here at SMU polled his class and asked students them how much money (cash) they had in their pockets at that very moment. The idea was that we wanted to see if there was evidence that those in charge of the vending machines should include the expensive bill / coin acceptor or if it should just have the credit card reader. However, another professor from Seattle University was asked to poll her class with the same question. Below are the results of our polls. SMU 34, 1200, 23, 50, 60, 50, 0, 0, 30, 89, 0, 300, 400, 20, 10, 0 Seattle U 20, 10, 5, 0, 30, 50, 0, 100, 110, 0, 40, 10, 3, 0 a. Run a two sample t-test to test if the mean amount of pocket cash from students at SMU is different than that of students from Seattle University. Write up a complete analysis: all 6 steps including a statistical conclusion and scope of inference (similar to the one from the PowerPoint). (This should include identifying the Ho and Ha as well as the p-value.) Also include the appropriate confidence interval. FUTURE DATA SCIENTIST’S CHOICE!: YOU MAY USE SAS OR R TO DO THIS PROBLEM! b. Compare the p-value from this test with the one you found from the permutation test from last week. Provide a short 2 to 3 sentence discussion on your thoughts as to why they are the same or different. Answer 8.1 Full Analysis Hypothesis Definition Hypothesis set up: H0 :µ1 − µ2 = 0 H1 :µ1 − µ2 , 0 Critical value and shaded distribution Next we draw and shade our distribution: In a two sample t-test, we have that: df = n1 + n2 − 2 where in our case, df = 16 + 14 − 2 = 28, α = 0.05. In this case we are performing a two tailed test. Now we input this information into SAS to draw our distribution[1]: data pdf; do x = -4 to 4 by .001; pdf = pdf("T", x, 14); /*here it is important to set up a two sided test*/ if x <= quantile("T",.025,28) then lower = pdf; else lower = 0; if x >= quantile("T",.975,28) then upper = pdf; else upper = 0; output; end; run; title ’Shaded t distribution’; proc sgplot data=pdf noautolegend noborder; yaxis display=none; band x = x lower = lower upper = upper / fillattrs=(color=gray8a); series x = x y = pdf / lineattrs = (color = black); series x = x y = lower / lineattrs = (color = black); run; With this bit of code, we have produced our shaded two tailed PDF: 32 Analysis Guide Midterm This critical value, where the bands start, is calculated using the following SAS code: data critval; p = quantile("T",.975,28); /*two sided test*/; proc print data=critval; run; This gives us a critical t value of ±2.04841 T statistic the t stat is calculated using the following code: Code 8.1. Two sided two sample t test in SAS proc ttest data=wallet alpha=.05 test=diff sides=2; /*an upper tailed test*/ class school; var cash; run; which tells us that our t statistic is −1.37 P value With the code from the previous step, we also see the p value, p = 0.1812: Hypothesis Assessment p = 0.1812 > α = 0.05 for the one tailed hypothesis test, indicating that we CANNOT REJECT the null hypothesis Conclusion and Scope of inference We cannot reject the null hypothesis, meaning we cannot say that the mean amount of cash in an SMU student’s wallet is any different than the mean amount of cash in a Seattle U student’s wallet. The following figure is a good reference for the results of this test: 33 Analysis Guide Midterm The circled area tells us the difference between the mean amount of cash in a Seattle student’s wallet and an SMU student’s wallet. We can see that the average student from the seattle sample had about 112 dollars less in his wallet than the average SMU student. This may sound like a lot, however it is not significant. For this result to be statistically significant, and the mean amount of cash in a Seattle U student’s wallet to be considered different than the mean amount of cash in an SMU student’s wallet, the difference of the two means would have to fall outside of the 95% confidence interval. The confidence interval is highlighted, and is (−281.2, 55.6817), which tells us that for the means to be considered truly different, the seattle student should have either 281 dollars less than the SMU student, or 55 dollars more. Our p value of 0.1812 tells us a similar story. It tells us that there is an 18% chance that a greater difference in the means would occur, which, at a 5 or 10 percent confidence interval, is not statistically significant at all. As for scope of inference, we cannot make inferences about the greater population of either university, because these were not random samples. We also cannot make causal inferences (eg going to SMU makes you have money in your wallet!), as this is not a randomized experiment either. Something about outliers! 34 Chapter 9 Problem 4: power Question 4. A. Calculate the estimate of the pooled standard deviation from the Samoan discrimination problem. Use this estimate to build a power curve. Assume we would like to be able to detect effect sizes between 0.5 and 2 and we would like to calculate the sample size required to have a test that has a power of .8. Simply cut and paste your power curve and SAS code. HINT: USE THE CODE FROM DR. McGEE’s lecture. Instead of using groupstddevs, use stddev since we are using the pooled estimate. B. Now suppose we decided that we may be able to live with slightly less power if it means savings in sample size. Provide the same plot as above but this time calculate curves of sample size (y-axis) vs. effect size (.5 to 2) (x axis) for power = 0.8, 0.7, and 0.6. There should be three plots on your final plot. Simply cut and paste your power curve and SAS code. HINT: USE THE CODE FROM DR. McGEE’s lecture. Instead of using groupstddevs, use stddev since we are using the pooled estimate. The effect size here refers to a difference in means, though there are many effect size metrics, such a Cohen’s D. C. Using similar code, estimate the savings in sample size from a test aimed at detecting an effect size of 0.8 with a power of 80% versus a power of 60%. Note: You will learn how to do this in R in a future HW! Answers 9.1 Single power curve he pooled standard deviation, calculated in Problem 2, part e, part 1, is sp = 6.5215. The difference of the means of the two groups, meandiff in the code, is just set to the difference between the means of our two populations, calculated using the R-generated means in Problem 2, Part f, µf − µuf = 1.924. The value of meandiff is not important, because by plotting the effect size, we are cycling through mean differences between 0.5 and 6, so the meandiff parameter only really matters if you want to know a sample size for a specific difference of means. When building a power curve it is not important at all, but you need it to get proc power to work. The SAS code used to build the power curve is shown below: Code 9.1. Proc power single with pooled variance proc power; twosamplemeans /*test=diff not diffsatt bc pooled variance*/ test=diff stddev=6.5215 /*meandiff is a dummy variable in this case*/ meandiff=1.924 power=.8 ntotal = .; plot x=effect min=.5 max=6; run; And the power curve: 35 Analysis Guide 9.2 Midterm Multiple power curves The same notes as above apply here, this time we used the SAS code to generate multiple power curves: Code 9.2. Producing several curves with proc power proc power; twosamplemeans /*test=diff not diffsatt bc pooled variance*/ test=diff stddev=6.5215 /*meandiff is a dummy variable in this case*/ meandiff=1.924 power=.8 .7 .6 ntotal = .; plot x=effect min=.5 max=6; run; And the curves: 9.3 Calculating change in N It is important to remember that the “effect size” calculated in this SAS code is the exact same thing as the “mean difference”. Therefore we can write our SAS code as follows: proc power; twosamplemeans test=diff /*diff not diffsatt bc pooled variance*/ stddev=6.5215 meandiff= 0.8 /*this represents the effect size*/ power=.8 .6 ntotal = .; run; Which gives us our sample size savings: 36 Analysis Guide Midterm As we see from the figure above, by raising the power from 0.6 to 0.8, we actually have to nearly double the sample size to meet the test parameters. By using a power of 0.6, we save 784 N’s (or sample size units) 37 Chapter 10 Unit 2 Lecture Slides 38 10/13/2018 Inference Using t-Distributions Central Limit Theorem M E A S U R I N G U N C E R TA I N T Y I N R A N D O M I Z E D A N D O B S E RVAT I O N A L STUDIES - D I S T R I B U T I O N O F T H E S A M P L E AV E R A G E -USING T-DISTRIBUTION FOR ONE SAMPLE INFERENCE - S TA R T I N G T O E X P LO R E T - D I S T R I B U T I O N F O R T W O S A M P L E P R O B L E M S 1 2 Distribution of Sample Average Distribution of Sample Average is unbiased. is a point estimate for µ The sample mean is an unbiased estimator for the population mean. 3 The more data you pick for each sample, the more normal (and tighter) the distribution of the sample mean is. Note that the distribution of the original data is the distribution of a sample mean of size 1. 4 The more data you pick for each sample, the more normal (and tighter) the distribution of the sample mean is. If original data is approx. normal, then the distribution of the sample mean will be approx. normal, regardless of sample size. µ http://onlinestatbook.com/stat_sim/sampling_dist/ 1 10/13/2018 Value (x) Trial X1 4 3.5 X2 3 3.5 X3 3 Dice: Individual Rolls (n = 1) 1 6 … … Frequency X5 1000 4 2 2000 3 1500 5.5 800 Dice: Sample Means of Size n = 2 Frequency Trial 1000 500 … 600 … 400 0 1 1.5 2 3.5 200 2.5 3 3.5 4 4.5 Average of 2 Dice 0 X6000 1 5 2 3 4 5 5.5 6 4 5 6 Roll of the Die Trial Trial 3.6 3.3 3 3.1 Dice: Sample Means of Size n = 5 2 2.9 1800 3 Dice: Sample Means of Size n = 10 4.3 1600 1400 … 3.1 1000 800 … 600 400 … 4.2 200 1 0 1 … 1.5 2 2.5 3 4.2 3.5 4 4.5 3.7 5 5.5 6 1600 1400 1200 1000 800 600 400 200 0 1 Average of 5 Dice 3.4 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Average of 10 Dice THE CENTRAL LIMIT THEOREM!!! Dice: Individual Rolls (n = 1) CENTRAL LIMIT THEOREM Cont. 0 2 3 4 5 6 Roll of the Die Dice: Sample Means of Size n = 2 2000 1500 1000 500 0 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Average of 2 Dice Dice: Sample Means of Size n = 10 Frequency 1 Frequency Frequency … 1800 Frequency Frequency 5.5 … 2000 1200 2000 1000 0 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Average of 10 Dice 2 10/13/2018 T-ratio is unbiased est. for Example: If we have data 79, 83, 84, 89, 90 mm for digitus tertius (the human middle finger). What is an estimate of the standard deviation? *This ratio HAS a t – distribution if Y is normally distributed. 13 Student t Distributions for n = 3 and n = 12 14 Example: 1 Sample Confidence Interval Student t distributions have the same general shape and symmetry as the standard normal distribution but reflect a greater variability (heavier tails), which is expected with small samples. The following are ages of 7 randomly selected patrons at the Beach Comber in South Mission Beach at 7pm. We assume that the data come from a normal distribution and would like to build a 95% confidence interval for the actual mean age of patrons at the Comber. 25, 19, 37, 29, 40, 28, 31 William Sealy Gosset (Student) n=7 = 29.86 σ = 7.08 = 0.05 /2 = 0.025 z/2 = 1.96 x – E < µ < x + E, where E = z2 σ = (1.96)(7.08) = 5.24 n 29.86 – 5.24 < µ IMPORTANT: These are the 7 plausible values < 29.86 + 5.24 of the mean given the data! 24.62 < < 35.10 We are 95% confident that the mean age of Beach Comber patrons at 7pm is contained in any 95% confidence interval, such as (24.62 years, 35.10 years). n=7 = 29.86 s = 7.08 x – E < µ < x + E, where E = t2, n-1 s = 0.05 /2 = 0.025 t/2, n-1 = 2.447 n 29.86 – 6.55 < µ = (2.447)(7.08) = 6.55 IMPORTANT: These are the plausible values < 29.86 + 6.55 of the mean given the data! 7 23.31 < < 36.41 We are 95% confident that the mean age of Beach Comber patrons at 7pm is contained any 95% confidence interval, such as (23.31 yrs., 36.41 yrs.). 3 10/13/2018 Comparison of z to t E=z σ 1 Sample Hypothesis Testing: The 6 Steps = (1.96)(7.08) = 5.24 2 n=7 = 29.86 n 7 σ = 7.08 We are 95% confident that x – E < µ < x + E = 0.05 the mean age of Beach /2 = 0.025 Comber patrons at 7pm is 29.86 – 5.24 < < 29.86 + 5.24 µ contained in the interval z/2 = 1.96 24.62 < < 35.10 (24.62 years, 35.10 years). 23.31 24.62 E = t2, n-1 s n=7 = 29.86 s = 7.08 = 0.05 /2 = 0.025 = n 29.86 – 6.55 < µ < 29.86 + 6.55 t/2, n-1 = 2.447 23.31 < < 36.41 35.10 36.41 (2.447)(7.08) = 6.55 7 We are 95% confident that the mean age of Beach Comber patrons at 7pm is contained in the interval (23.31 years, 36.41 years). 1. Identify Ho and Ha. 2. Find the Critical Value(s) and Draw and Shade. 3. Calculate the Test – Statistic. (The evidence!) 4. Calculate the P-value. 5. Make a decision… Reject Ho or FTR Ho. 6. Write a clear conclusion in the context of the problem…. Use mostly non statistical terms but always report the p-value! Add a confidence interval if appropriate. End this conclusion with a statement about the scope. 20 Example: 1 Sample t-test Let’s Formalize This Test Into 6 Steps! Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. The following are ages of 7 randomly chosen patrons seen leaving the Beach Comber in South Mission Beach at 7pm. We assume that the data come from a normal distribution and would like to test the claim that the mean age of the distribution of Comber patrons is different than 21. 25, 19, 37, 29, 40, 28, 31 Let’s Formalize This Test Into 6 Steps! Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Step 2: Draw and Shade and Find the Critical Value. Let’s Formalize This Test Into 6 Steps! Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Step 2: Draw and Shade and Find the Critical Value. df = 7 – 1 = 6 df = 7 – 1 = 6 21 t 21 t Step 3: Find the test statistic. (The t value for the data.) 4 10/13/2018 Let’s Formalize This Test Into 6 Steps! Let’s Formalize This Test Into 6 Steps! Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Step 2: Draw and Shade and Find the Critical Value. Step 2: Draw and Shade and Find the Critical Value. Step 3: Find the test statistic. (The t value for the data.) Step 3: Find the test statistic. (The t value for the data.) Step 4: Find the p-value: The probability of observing by random chance something as extreme or more extreme than what was observed under the assumption that the null hypothesis is true. (Usually found with software.) The red shaded region above is 0.0162 (sum of both red areas) Step 4: Find the p-value: P-value 0.0162< .05 Step 5: Key! The sample mean we found is very unusual under the assumption that the true mean age is 21. So we Reject the assumption that the true mean age is 21. That is, we REJECT Ho. Let’s Formalize This Test Into 6 Steps! Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Finding the P-value – more detail Step 2: Draw and Shade and Find the Critical Value. Step 4: Find the p-value: p-value < .05 You could use Stat Trek / or the t-table. Confidence interval OR Step 3: Find the test statistic. (The t value for the data.) Software like SAS: Step 4: Find the p-value: P-value 0.0162 < .05 Step 5: REJECT Ho Step 6: There is sufficient evidence to conclude that the true mean age of patrons at the Comber at 7pm is not equal to 21 (p-value =0.0162 from a t-test). We could also say that there is sufficient evidence to conclude that the true mean is greater than 21. (Consider the red area in the right most tail.) This was not a random sample of all times, only at 7pm; thus, the result cannot be applied to the bar at all times. The results are nevertheless intriguing. 28 One-Sided Test + Two-Sided CI Demonstration One-Sided Test + Two-Sided CI Demonstration 29 30 5 10/13/2018 One-Sided Test + Two-Sided CI Demonstration One-Sided Test + Two-Sided CI Demonstration Suppose we would like to test the claim that the mean age of patrons is Suppose we would like to test the claim that the mean age of patrons is greater than 24. greater than 24. One Sided-Test at alpha = 0.05 Two Sided-Test at alpha = 0.05 Two Sided-Test at alpha = 0.1 Two Sided-Test at alpha = 0.05 31 32 One-Sided Test + Two-Sided CI Demonstration TWO SAMPLE T-TEST FOR THE DIFFERENCE OF MEANS WITH INDEPENDENT SAMPLES Perform a two sample t-test for the difference in the mean score between the Intrinsic and Extrinsic groups from the chapter problem. Provide a complete analysis, including a full conclusion, confidence interval, and scope of inference. Use an alpha = .01 level of significance. 33 Let’s Formalize This Test Into 6 Steps! Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. 34 Let’s Formalize This Test Into 6 Steps! Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Step 2: Draw and Shade and Find the Critical Value. df = 24 +23 – 2 = 45 Which is equivalent to: 0 t 6 10/13/2018 Let’s Formalize This Test Into 6 Steps! Let’s Formalize This Test Into 6 Steps! Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Step 2: Draw and Shade and Find the Critical Value. Step 2: Draw and Shade and Find the Critical Value. df = 24 +23 – 2 = 45 0 t Step 3: Find the test statistic. (The t value for the data.) Step 4: Find the p-value: The probability of observing by random chance something as extreme or more extreme than what was observed under the assumption that the null hypothesis is true. (Usually found with software.) The red shaded regions above. 0.0054 Step 3: Find the test statistic. (The t value for the data.) Let’s Formalize This Test Into 6 Steps! Let’s Formalize This Test Into 6 Steps! Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Step 2: Draw and Shade and Find the Critical Value. Step 2: Draw and Shade and Find the Critical Value. Step 3: Find the test statistic. (The t value for the data.) Step 3: Find the test statistic. (The t value for the data.) Step 4: Find the p-value: P-value 0.0054< .01 Step 5: REJECT Ho Step 6: There is sufficient evidence to suggest that those who receive the Intrinsic treatment have a different mean score than those who receive the Extrinsic treatment (p-value = .0054 from a t-test). We can also claim that the mean intrinsic score is greater than the extrinsic one. (The burden of rejecting the null hypothesis for a one-tailed test is less than a two-tailed test, given the test is in the relevant direction.) A 99% confidence interval for this difference is (.3347, 7.95). Since this was a randomized experiment, we can conclude that the Intrinsic treatment caused this difference. However, since the study was of volunteers (sampling bias), this inference can only be generalized to the 47 participants. Step 4: Find the p-value: P-value 0.0054< 0.01 COMPARE WITH RANDOMIZATION (PERMUTATION) TEST Finding the P-value Step 4: Find the p-value: P-value < .01 4.14 -4.14 1000 different groupings (relabelings) You could use Stat Trek / or the t-table. OR Software like SAS: Obs Variable Class Method Variances Mean Low erCLMean UpperCLMean StdDev Low erCLStdDev UpperCLStdDev UMPULow erCLSt dDev UMPUUpperCLSt dDev 1 COL139 Diff (1-2) Pooled Equal 4.4678 1.6594 7.2762 4.7786 3.9635 6.0187 3.9360 5.9708 2 COL170 Diff (1-2) Pooled Equal -4.3192 -7.1485 -1.4899 4.8141 3.9930 6.0634 3.9653 6.0152 3 COL279 Diff (1-2) Pooled Equal -4.5576 -7.3530 -1.7623 4.7564 3.9451 5.9908 3.9178 5.9430 4 COL360 Diff (1-2) Pooled Equal -4.8897 -7.6340 -2.1454 4.6695 3.8731 5.8814 3.8462 5.8345 5 COL537 Diff (1-2) Pooled Equal 4.3826 1.5621 7.2031 4.7991 3.9806 6.0446 3.9530 5.9964 6 COL551 Diff (1-2) Pooled Equal -5.0514 -7.7692 -2.3337 4.6243 3.8356 5.8245 3.8090 5.7781 7 COL604 Diff (1-2) Pooled Equal -4.7109 -7.4832 -1.9385 4.7172 3.9127 5.9415 3.8855 5.8942 8 COL664 Diff (1-2) Pooled Equal 4.6636 1.8840 7.4431 4.7295 3.9228 5.9569 3.8956 5.9095 There is strong evidence to suggest that the mean score of those who receive intrinsic motivation is not equal to those who receive the extrinsic motivation (p-value = .008). The burden to reject the null hypothesis is lower under a one-sided test, so we can say that the evidence supports the claim that the intrinsic mean is higher than the extrinsic mean. Since this was a randomized experiment, we can conclude that the intrinsic motivation caused this increase. In addition, since these were volunteers, this inference can only be assumed to apply to these 47 subjects, although the findings are very intriguing. 41 7 10/13/2018 Let’s Talk Power!!! Explore power! Here is an applet that will show you what happens to the power/beta when you change the sample size, alpha, standard deviation, or effect size (measure of the difference between null mean and actual (alternative) mean). http://shiny.stat.tamu.edu:3838/eykolo/power/ Effect size basically measures the difference between the population mean (106) and the null mean(100). (It’s not exactly this, though.) 43 44 Pick all that are true. The power increases when: (Go to break out) Consider the following options. A. The probability of rejecting Ho when the null is true. A. The sample size decreases. B. The probability of accepting Ho when the null is true. B. The sample size increases. C. The probability of rejecting Ho when the null is false. C. The standard deviation / standard error decreases. D. The probability of FTR Ho when the null is true. D. The effect size increases. E. The probability of FTR Ho when the null is false. E. The effect size decreases. C WHICH IS POWER? ___ A WHICH IS ALPHA? ___ E WHICH IS BETA? ___ 45 Pick all that are true. The power increases when: 46 Appendix A. The sample size decreases. B. The sample size increases. C. The standard deviation / standard error decreases. D. The effect size increases. E. The effect size decreases. 47 48 8 10/13/2018 Distribution of Sample Average ANOTHER EXAMPLE FOR PRACTICE 49 50 H0: = 1.8 H1: ≠ 1.8 = 0.05 x = 1.713 s = .2588 H0: = 1.8 H1: ≠ 1.8 = 0.05 x = 1.713 s = .2588 Critical Values t = ± 2.145 On the basis of this test, there is not enough evidence to reject the claim that the mean weight of bumblebee bats is equal to 1.8g (p-value = .2155 from a t-test). A 95% confidence interval is (1.57 g, 1.8566 g). The problem was ambiguous on the randomness of the sample; thus, we will assume that it was not a random sample, which makes inference to all bats strictly speculative. 9 Part III A Closer look at Assumptions 48 Chapter 11 Problem 1: Two Sample T test with assumptions Question 1. In the United States, it is illegal to discriminate against people based on various attributes. One example is age. An active lawsuit, filed August 30, 2011, in the Los Angeles District Office is a case against the American Samoa Government for systematic age discrimination by preferentially firing older workers. Though the data and details are currently sealed, suppose that a random sample of the ages of fired and not fired people in the American Samoa Government are listed below: Fired 34 37 37 38 41 42 43 44 44 45 45 45 46 48 49 53 53 54 54 55 56 Not fired 27 33 36 37 38 38 39 42 42 43 43 44 44 44 45 45 45 45 46 46 47 47 48 48 49 49 51 51 52 54 a. Check the assumptions (with SAS) of the two-sample t-test with respect to this data. Address each assumption individually as we did in the videos and live session and make sure and copy and paste the histograms, q-q plots or any other graphic you use (boxplots, etc.) to defend your written explanation. Do you feel that the t-test is appropriate? b. Check the assumptions with R and compare them with the plots from SAS. c. Now perform a complete analysis of the data. You may use either the permutation test from HW 1 or the t-test from HW 2 (copy and paste) depending on your answer to part a. In your analysis, be sure and cover all the steps of a complete analysis: 1. State the problem. 2. Address the assumptions of t-test (from part a). 3. Perform the t-test if it is appropriate and a permutation test if it is not (judging from your analysis of the assumptions). 4. Provide a conclusion including the p-value and a confidence interval. 5. Provide the scope of inference. Answer 11.1 Complete Analysis Assmuption checking in SAS The assumptions were tested using proc ttest, which outputs histograms, box plots, QQ-plots, and performs an F-test on the variances. The code used to produce all information in this section is presented below: Code 11.1. Checking the assumptions of a t test in SAS proc ttest data=samoa alpha=.05 test=diff sides=U; /*an upper tailed test*/ class fired; var age; run; Normality The normality of the data is checked using a QQ plot, a boxplot, and a histogram. First we will examine the QQ plot: 49 Analysis Guide Midterm Figure 11.1.1. Q-Q Plot for Normality In Figure 1.1, the y axis represents the data set, and the x axis the theoretical normal quantile. The line represents what a normal data set should look like, a 1-1 ratio between the data variable and the theoretical normal quantile. The data set follows the normal line pretty well, so in this case on a visual inspection, we can say both samples are normal. We can double check this using Figure 1.2, a histogram and boxplot: Figure 11.1.2. Histogram and Boxplot for Normality It is a bit harder to assess the normality using the histogram and boxplot, but SAS gives us useful kernel lines which show the distribution of the data in the histogram (the red line is the data and the blue line is normal). As we can see, the data loosely follows the normal distribution, it is a bit different but it is pretty close. The box plot tells the same story, as in both cases the mean is very near the medium (in a normal distribution the mean and median are the same), with slight left and right skewing, but overall we can assume the data is normal. Equal Variances In order to assess the equality of the variances visually, we can again use the histogram and boxplot, this time displayed in Figure 1.3 (for ease of grading): Figure 11.1.3. Histogram and Boxplot for Variance Equality As we can see from the bounds of the histogram, the range of each data set is more or less the same size, with their means more or less in the center. This hints that the two data sets would have near equal variances. This is confirmed when looking at the box plot, the distance from the mean to the far left whisker and far right whisker is more or less the same for both data sets, which indicates again the variances are 50 Analysis Guide Midterm equal. This is confirmed by examining the F test for equal variances, the results of which are displayed below: Figure 11.1.4. F Test for Equal Variances The F test is valid here, because the data is normal and the sample size is large (n ∼ 30), and we see that the probability the variance difference is greater than what it is in our case is 60%, or a p value of 0.6 At a 5, 10, 15 or 20 percent confidence interval, the f test will tell us the variances are equal. Therefore, we can assume equal variances. Independence In this case, we can assume independence, the two data sets do not relate to each other. Any dependence that exists we will assume away, for the sake of the problem Conclusion In my opinion, we can use a t-test for this data set, based on the fact that all the assumptions are true. Assumption Checking in R Normality test To test for normality, we are going to again use the Q-Q plot and the histogram. To produce the Q-Q plots, the following code was used: The plots produced are shown below: Code 11.2. t test Assumption checking in R, Q-Q plot 1 2 3 4 5 6 7 8 #producing adjacent Q-Q plots par(mfrow=c(1,2)) qqnorm(Fired ,main="Normal Q-Q Plot for Fired data", xlab = "Normal Quantiles", ylab = "Fired Quantiles") qqnorm(Not_fired ,main="Normal Q-Q Plot for Not Fired data", xlab = "Normal Quantiles", ylab = "Not Fired Quantiles") Figure 11.1.5. Q-Q plots for Normality in R From the linearity of the data points in this figure, we can see that the data follows a more or less normal ditribution. The Q-Q plot produced in R is almost exactly the same as the Q-Q plot produced using SAS, however it is different in that it does not have a lovely line representing perfect normality, and the size of the boxes changes with window size, as does the aspect ratio, which is a bit of a pain. The following code is used to produce a histogram, further examining normality: This produces the following figure: 51 Analysis Guide Midterm Code 11.3. t test Assumption checking in R, Histogram #producing the adjacent histograms par(mfrow=c(1,2)) hist(Fired) hist(Not_fired) 1 2 3 4 Figure 11.1.6. Histogram for Normality in R As can be seen in the figure, the distribution of these two data sets is again more or less normal, with what appears to be the mean and median lying in the center, however there is a bit of a bump in the fired data set, but again it is loosely normal in appearance. The graphs again look the same as in SAS more or less, other than formatting differences. We can identify numbers better in R. In this case, we can ASSUME NORMAL Equality of Variances Looking at the histogram in Figure 1.6, we can see that the fired data has a mean of about 45 years old, spanning from 30 to 60, and the not fired data has a mean of about 40 years old, spanning from 25 to 55. The spread of the two means is more or less the same in this case, therefore we can ASSUME EQUAL VARIANCEs Independence We can again assume independence. Conclusion: The t-test is appropriate Complete Analysis: Problem statement: We would like to test the claim that the mean age of the individuals who were fired is greater than the mean age of the individuals who were not fired. Assumptions: We can assume normality, independence, and equal variances and therefore we can use the student t test, as proven in sections 1.a and 1.b. t-test Statement of the Hypotheses: H0 :µf − µuf ≤ 0 H1 :µf − µuf > 0 Shaded Distribution and Critical Values: In a two sample t-test, we have that: df = nf + nnf − 2 where in our case, df = 21 + 30 − 2 = 49, α = 0.05 Now we input this information into SAS to draw our distribution[1]: data pdf; do x = -4 to 4 by .01; pdf = pdf("T", x, 49); lower = 0; 52 Analysis Guide Midterm if x >= quantile("T",0.9,49) then upper = pdf;/*one sided*/ else upper = 0; output; end; run; title ’Shaded t distribution’; proc sgplot data=pdf noautolegend noborder; yaxis display=none; band x = x lower = lower upper = upper / fillattrs=(color=gray8a); series x = x y = pdf / lineattrs = (color = black); series x = x y = lower / lineattrs = (color = black); run; Giving us this lovely graph: Next we find a number for the critical value, using the same code as problem 1: data critval; p = quantile("T",.95,49); /*one sided test*/; proc print data=critval; run; This gives us a critical t value of 1.67655. Calculation of t statistic: Next we calculate our two sample t statistic using SAS: proc ttest data=samoa alpha=.05 test=diff sides=U; class fired; var age; run; Which tells us that our t statistic is 1.10 Calculation of P-value With the code from the previous step, we also see the p value: 53 Analysis Guide Midterm p = 0.1385 Discussion of the Null Hypothesis p = 0.1385 > α = 0.05 for the one tailed hypothesis test, indicating that we CANNOT REJECT the null hypothesis Conclusion: We cannot reject the null hypothesis, meaning we cannot say that older workers were fired from the Samoan government. Note that we used a one tailed hypothesis test in this scenario, as we wanted to deternine if the fired group was OLDER than the nonfired group. With a one-sided p-value of 0.1385, there is a nearly 14% chance that there be a greater difference in mean ages given the distribution. At a critical p-value of .05 (5%), we can say that this data fails to reject the null hypothesis. Using the code that calculated the t statisitic, we produce the following one sided confidence interval: The confidence interval is: [−1.0107, ∞). This confidence interval represents the upper difference of means at a 95% confidence level. We can interpret this as follows: if the confidence interval contains the null hypothesis, then we cannot reject it. However if it does not contain the null hypothesis, we must reject it. As we can see in this beautifully drawn figure, the null hypothesis, µf − µnf ≤ 0 is contained within our CI: . This means we cannot reject the null hypothesis, we cannot say there was age discrimination. It is plausible that the mean differnence of the entire population of samoan government employees is less than or equal to zero, as it is within the 95% confidence interval, which means we cannot, as objective jurors, claim there was age discrimination. Scope of Inference: Since this sample was random, we can make generalizations about the Samoan Government as a whole, however, we cannot make causal inferences, as this was not a randomized experiment. 54 Chapter 12 Outliers and Logarithmic Transformations 55 Analysis Guide Midterm The permutation test was performed using the following code: We will now perform the same procedure on the assumptions without an outlier, as well as some other comparisons. Unless otherwise noted, the following code was used to produce the results and to remove outliers: 67 Analysis Guide Midterm Code 12.1. Automatically input permutation test in SAS /*Permutation test*/ data Wallet; INFILE ’file location’; INPUT school $ cash; run; proc iml; use Wallet var {school cash}; /*making two groups in IML*/ read all var {cash} where(school=’SMU’) into g1; read all var {cash} where(school=’SEU’) into g2; obsdiff = mean(g1) - mean(g2); print obsdiff; call randseed(12345); /* set random number seed */ alldata = g1 // g2; /* stack data in a single vector */ N1 = nrow(g1); N = N1 + nrow(g2); NRepl = 9999; /* number of permutations */ nulldist = j(NRepl,1); /* allocate vector to hold results */ do k = 1 to NRepl; x = sample(alldata, N, "WOR"); /* permute the data */ nulldist[k] = mean(x[1:N1]) - mean(x[(N1+1):N]); /* difference of means */ end; title "Histogram of Null Distribution"; refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);"; call Histogram(nulldist) other=refline; pval = (1 + sum(abs(nulldist) >= abs(obsdiff))) / (NRepl+1); /*this means two sided test*/ print pval; run; Code 12.2. Outlier removal in SAS data Wallet; INFILE ’file location’; INPUT school \$ cash; run; data CleanCash; set Wallet; /*we are going to remove all the really high values*/ if cash >150 then delete; run; proc ttest data=CleanCash alpha=.05 test=diff sides=2; /*a 2 tailed test*/ class school; var cash; run; 68 Chapter 13 Log Transformed data 13.1 Full Analysis Problem Statement: We would like to test the claim that the distribution of incomes for those who have 16 years of education is greater than those who have 12 years of education. Assumptions We first produce the plots for our assumption analysis using the following bit of code: proc import /*to use proc import first we specify the file*/ datafile=’genericfilepath/genericname.csv’ /*then we specify the name of the output dataset*/ out=edudata /*then we specify the data type*/ dbms=CSV; run; proc sort data=edudata; by descending educ; run; proc ttest data=edudata order=DATA /*This changes theorder of the groups you are using to the one you set*/ sides=U; /*an Upper tailed test*/ class Educ; var Income2005; run; Producing the following figures: Figure 13.1.1. Q-Q plot of sample 69 Analysis Guide Midterm Figure 13.1.2. Histogram and Boxplot of the sample Normality assumption: Looking at the Q-Q plot(Figure 3.1), it is clear to see that the data is not normal at all. To investigate further, we will look at the histograms and box plots in Figure 3.2. These paint a more complete picture, we see that the data is skewed to the right, and that the higher values are much greater than the lower values (hundreds of thousands of times). To combat this, lets perform a natural log transformation with this bit of code and see whatthe data looks like: Code 13.1. log transform in SAS data edudata2; set edudata; lincome=log(Income2005); run; proc ttest data=edudata2 order=DATA sides=U; /*an Upper tailed test*/ class Educ; var lincome; run; Producing the following figures: Figure 13.1.3. Q-Q plot of logs Figure 13.1.4. Histogram and Boxplot of Logs 70 Analysis Guide Midterm With this transformation, we first look at the Q-Q plot (Figure 3.3), and we see that the data is mostly normal! Looking at the histograms (Figure 3.4) this is confirmed, just in their shape and the shape of the kernel density plots. The nearness of the median to the mean is also a telltale sign the data is normal. Therefore, we can assume the log-transformed data is normal. Equality of Variances Since we cannot assume normality with the untransformed data, it makes little sense to analyze the equality of variances of that data set. We will look at the log transformed data for the equality of variances. Looking at figure 3.4, we see that the spread of the two data sets is pretty similar, just in the histograms, they are of similar length, where the 12 year data set is a bit narrowerthan the 16 year set. The Boxplot confirms this, the distance from the means to the end of the whiskers is roughly the same for both plots, as well as within the IQRS. The one with the larger mean also has a larger variance, Therefore, we can assume the log transformed data has equal variances. Independence We can assume the data is independent in this scenario. 3.3 Hypothesis testing We will be using a one tailed pooled t test of the log transformation of the data in this scenario, so that we can do a t test Statement of Hypotheses: Note that since we are dealing with a pooled t-test of a log transformation, we are dealing in medians rather than means, the medians should tell us whether or not the distribution of the people with 16 years of education exceeds that of those with 12 years of education H0 :Median16 = Median12 H1 :Median16 > Median12 H0 : distribution16 =distribution12 H1 : distribution16 >distribution12 Critical Value In this scenario, α = 0.1 and df = 1424, and from that we can shade a one sided distribution and find a critical value, using the code below: data pdf; do x = -4 to 4 by .01; pdf = pdf("T", x, 1424); lower = 0; if x >= quantile("T",0.9,1424) then upper = pdf;/*one sided*/ else upper = 0; output; end; run; title ’Shaded t distribution’; proc sgplot data=pdf noautolegend noborder; yaxis display=none; band x = x lower = lower upper = upper / fillattrs=(color=gray8a); series x = x y = pdf / lineattrs = (color = black); series x = x y = lower / lineattrs = (color = black); run; data critval; p = quantile("T",.9,1424); /*one sided test*/; proc print data=critval; run; This produces the shaded distribution: 71 Analysis Guide Midterm Figure 13.1.5. Shaded t distribution and a critical value of t = 1.28215 Calculation of the t statistic: Now we calculate our t statististic using the code from Section 3.2.1, which tells us that t = 10.98, which is an astounding value! Calculation of the p-value: p < 0.0001, see the figure above! 3.3.5 Discussion of the Null hypothesis We REJECT the null hypothesis, p ≈ 0 < 0.1 = α Conclusion We Reject the null hypothesis which states that the two distributions are equal. We have convincing evidence that the income distribution of the people with 16 years of education is greater than those with 12. With a one-sided p value of ~0, the distributions are very different, the median income of the people with a 16 year education is evidently greater than the median income of people with a 12 year education. The figure below shows the difference between the natural logarithm of the two medians: This tells us that the median income of people with 16 years education is e0.5699 = 1.77 times greater than those with 12 years of education. A 90% confidence interval for this multiplicative effect is 1.62 to 1.93 times. 72 Analysis Guide Midterm We cannot make causal inferences in this scenario, as there was no random experimentation, and we cannot make population inferences either, as there was no random sampling 73 Chapter 14 Unit 3 Lecture slides 74 10/13/2018 Confidence Intervals and Hypothesis Tests Chapter 3 α = .05 95% CI Vs. α = .05 Hyp Test A Closer Look at Assumptions! For the corresponding alpha, a (1-alpha)% CI will contain mu_0 when the test of Ho: mu = mu_0 fails to reject Ho and will not contain mu_0 when the test rejects Ho. 1 2 The Take Away α = .01 Two-Sided 100(1-α)% Confidence Intervals are Equivalent to TwoTailed Hypothesis Tests that have an α level of significance. Confidence Intervals and Hypothesis Tests “Equivalent” here means that if we test any specific value in the interval, the test will FTR Ho. And if we test any specific value outside the interval, the test will Reject Ho. Example: 95% confidence interval for the mean is equivalent to an α = .05 hypothesis test. 99% CI Vs. α = .01 Hyp Test Example: 99% confidence interval for the mean is equivalent to an α = .01 level hypothesis test. 3 Assumptions of one sample T-Tests So we can evaluate hypothesis tests through the evaluation of confidence intervals! 4 Robustness of One Sample T-test / CI When the original (population) distribution is not normal, the one sample t-test is still valid with a large enough sample size. (Central Limit Theorem) That is, the one sample t-test is robust to the normality assumption when the sample size is large enough. 1. Samples are drawn from a normally distributed population. 2. The observations in the sample are independent of one another. 5 6 1 10/13/2018 Assume the population distribution is Exponential. With 𝜆= 1. 1000 CIs for the Mean of an Exponential(1) Distribution: n = 10 Note the Right Skew! Note the Right Skew! 7 1000 CIs for the Mean of an Exponential(1) Distribution: n = 100 8 Given Data, How Do We Check the Normality Assumption? Visually! n = 100 Note the Right Skew! Histogram n = 100 q-q Plot Note the greater symmetry and smaller standard deviation. 9 Normal q-q Plot DATA 41.2 76.6 109.3 134.5 148.6 data 41.2 76.6 109.3 134.5 148.6 rank 1 2 3 4 5 middle = (rank + previous rank)/2n 0.1 0.3 0.5 0.7 0.9 standard normal hypothetical value based on middle -1.28 -0.52 0.00 0.52 1.28 hypothetical data if data were perfectly normal 46.09 79.15 102.04 124.93 157.99 10 Normal q-q Plot z-score of data = (data -xbar)/s -1.39 -0.58 0.17 0.74 1.07 102.04 =xbar 43.65459 =s 5 =n Q-Q plots are constructed differently depending on the software or textbook, but usually include some combination of the above columns. If the graph plots green vs. green or orange vs. orange, if the data is normal, then points should fall close to the line y=x. If one green and one orange are used, if the data is normal, the points should fall along a straight line, but not necessarily one with slope=1. Different software will calculate this line differently. 11 12 2 10/13/2018 Given Data, How Do We Check the Normality Assumption? Visually! n = 100 Given Data, How Do We Check the Normality Assumption? Visually! n = 100 Histogram n = 15 q-q Plot n = 15 Histogram Not normal! Data is skewed to the right and does not fall along a straight line in this q-q 13 plot. q-q Plot Data comes from a normal distribution, but it is hard to tell given the small sample size. 14 Given Data, How Do We Check the Normality Assumption? Visually! Beware of small sample sizes! n = 15 n = 15 n = 15 Histogram n = 15 q-q Plot It looks like the data might not be normal (skew, curvature of q-q plot), but it is hard to tell with this small sample size. Histogram 15 The histogram shows an almost bimodal distribution (definitely not normal), but again it is 16 hard to tell with small sample sizes. The q-q plot does not look too far away from normality. A Way to Decide: Small Sample Size Large Sample Size Little to no Evidence Against Normality No Problem if you feel Normality is a safe assumption … run the TTest. (You may want to be “conservative” here and run a test with fewer assumptions.) No Problem! Run the T-Test Significant Evidence Against Normality Assumptions are not met and test is not robust here … Try a transformation and, if appropriate, run a t-test. If not appropriate, do NOT run the T-Test and proceed to a test with fewer / different assumptions. No Problem .. You have the Central Limit Theorem. Run the TTest. q-q Plot A Complete Analysis: Statement of the Problem Address the Assumptions Perform the Appropriate Test (5 Steps) Step 6: Provide a conclusion that a non statistician can understand, include a p-value and confidence interval. • Scope of Inference • • • • 17 18 3 10/13/2018 Example: Beach Comber Example: Comber The following are ages of 7 randomly chosen patrons seen leaving the Beach Comber in South Mission Beach at 7pm! We assume that the data come from a normal distribution and would like to test the claim that the mean age of the distribution of Comber patrons is different than 21. 25, 19, 37, 29, 40, 28, 31 PROBLEM STATEMENT: Test the claim that the mean age of Beach Comber patrons at 7pm is different from 21. ASSUMPTIONS: Normal Population Distribution: Judging from the histogram and q-q plots, there is little to no evidence that the population distribution of patron ages at the Comber at 7pm is not normal. We will assume that this distribution is normal and proceed. Independence: These subjects were randomly selected from the population; thus, we will assume that the observations are independent. 19 20 Revised Write Up! Example: Bats We would like to test the claim that the population mean is different from 21. To do this, we take a sample of size n = 7 and find that 𝑥̅ = 29.86 years and s = 7.09 years. Ho: µ = 21 Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Ha: µ ≠ 21 Step 2: Draw and Shade and Find the Critical Value. 𝑡= Step 3: Find the test statistic. (The t value for the data.) Step 4: Find the p-value: P-value = .0162 < .05 29.86 − 21 𝑥̅ − 𝜇 𝑠 = 7.09 𝑛 7 = 𝟑. 𝟑𝟏 Step 5: REJECT Ho Step 6: There is sufficient evidence to conclude that the true mean age of patrons at the Comber at 7pm is different from 21 (p-value =.0162 from a t-test). A 95% confidence interval for the mean age is (23.3, 36.4) years. Scope: Since this was a random sample, we can generalize these findings to the entire population of Comber patrons at 7pm. Note that we have evidence to support the claim that the mean age is greater than 21 as well. 22 Example: Bats H0: 𝝁= 1.8 H1: 𝝁 ≠ 1.8 𝜶 = 0.05 𝒙= 1.713 s = .2588 PROBLEM STATEMENT: Test the claim that the mean weight of the bumble bee bat is different from 1.8 g. ASSUMPTIONS: Normal Population Distribution: Judging from the histogram and q-q plots, there is some visual evidence of a departure from normality. With a sample size of 15 and no extreme outliers, we will assume the distribution of sample means is decently approximated by a normal distribution via the CLT and proceed with caution. Independence: Not much is known about the sampling scheme used to obtain this sample. We will assume the observations are independent. Critical Values Test Statistic t = -1.297 t = ± 2.145 P-value: .2155 > .05 Fail to Reject H0 On the basis of this test, there is not enough evidence to reject the claim that the mean weight of bumblebee bats is equal to 1.8 g (p-value = .2155 from a t-test). A 95% confidence interval is (1.57, 1.8566) grams. The problem was ambiguous on the randomness of the sample; thus, we will assume that it was not a random sample, which makes inference to all bats strictly speculative. 23 24 4 10/13/2018 Assumptions of one and two sample T-Tests What happens if the normality assumption is broken? Many times …. NO PROBLEM!!! 1. Samples are drawn from a normally distributed population. 2. If it is a two sample test, both populations are assumed to have the same standard deviation (same shape). 3. The observations in the sample are independent of one another. x x 𝐶𝑒𝑛𝑡𝑎𝑙 𝐿𝑖𝑚𝑖𝑡 𝑇ℎ𝑒𝑜𝑟𝑒𝑚 𝑥̅ 𝑥̅ 𝜇 𝜇 25 When data is not normal 26 2. In a two sample test, both populations are assumed to have the same standard deviation (same shape). 𝐴𝑠𝑠𝑢𝑚𝑒: 𝜎 = 𝜎 𝜇 𝜇 𝑊𝑒 𝑤𝑎𝑛𝑡 𝑖𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑜𝑛 ∶ 𝜇 − 𝜇 27 Evidence of Inequality of Variance: VISUAL 28 Evidence of Inequality of Variance: F-Test for Equal Variance Ho: population variances are equal Ha: population variances are not equal There is not sufficient evidence to conclude the variances are different (p-value = .4289 from a F-Test.) Little visual evidence against equal standard deviations (variances). 29 30 5 10/13/2018 Evidence of Inequality of Variance: VISUAL Evidence of Inequality of Variance: F-Test for Equal Variance Ho: population variances are equal Ha: population variances are not equal There is not sufficient evidence to conclude the variances are different (p-value = .1043 from a F-Test.) Strong visual evidence against equal standard deviations (variances). 31 Evidence of Inequality of Variance: F-Test / VISUAL The F-test has a strong assumption that the two populations that it is testing the variances of must be normal. It is not robust to this assumption. Since the second distribution has strong evidence of right skew, the F-test for Equal Variance is not appropriate here. For this example, the visual evidence is so strong that we would not need to consult a hypothesis test to test this assumption of equal variances. 32 What happens if the assumption of equal variances (standard deviations) is broken? In some circumstances …. This could be serious …. In others….. No Problem! However, later in the semester we will study a test of spread/dispersion that does not have this assumption and can be used in a wider range of statistical environments. 33 When variances are not equal 34 The Take Away What you will find in practice will most likely not fit exactly into the scenarios identified here. There will be some judgment involved … this is the “art” of statistics. Here are some general rules of thumb that we will assume this semester. 1. If sample sizes are the same and sufficiently large, the t tools (tests and confidence intervals) are valid … since they are robust to the violation of normality. 2. If the two populations have the same standard deviation, then the t tests are valid … given sufficient sample sizes. 3. If the standard deviations are different and the sample sizes are different then the t tools are not valid and another procedure should be used. (Ch. 4) 35 36 6 10/13/2018 FULL EXAMPLE: CREATIVITY STUDY! A Complete Analysis: We would like to test the claim that the mean score of the Intrinsic group is different than that of the Extrinsic group. To do this we take a sample of size nI = 24 and nE = 23 and find that 𝑥̅ I = 19.88 points, 𝑥̅ E = 15.74, sI = 4.44, and sE= 5.25 points. Statement of the Problem Address the Assumptions Perform the Appropriate Test (5 Steps) Step 6: Provide a conclusion that a non statistician can understand. Include a p-value and confidence interval • Scope of Inference • • • • Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Ho: µ𝐼 = µ𝐸 Ha: µ𝐼 ≠ µ𝐸 Which is equivalent to: Ho: µ𝐼 − µ𝐸 = 0 Ha: µ𝐼 − µ𝐸 ≠0 37 Full Example: Creativity Data First Check …. q-q Plot State the Problem: We would like to test the claim that the mean score of the Intrinsic group is different than that of the Extrinsic group. Check Assumptions: 1. Normally Distributed Populations The q-q plots for both populations look sufficiently normal. We look at the histograms as well … but there is not sufficient evidence here to suggest that they are not normal. 39 Histograms • Keeping in mind the relative small sample size from each population, we do not observe any extreme outliers and observe a pretty strong bell shape which lends evidence to support normality of the populations. 41 40 Normality Assumption Visual inspection of the histograms and q-q plots of each population are consistent with the normality of each population. We assume normality and move on to the second assumption. 42 7 10/13/2018 Equality of Variances Full Example: Creativity Data State the Problem: We would like to test the claim that the mean score of those with intrinsic motivation is the same for those with extrinsic motivation. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations A visual check was done by looking at the histograms, which reveal similar shapes and support the equal variances assumption. You can assume equal variances here. 43 Full Example: Creativity Data Since we are able to assume normal population distributions, we can use the F-Test to provide secondary evidence if the visual is inconclusive. Since the p-value is greater than our significance level of alpha = 0.05, we fail to reject the null hypothesis of equality (p-value = 44 0.1043) and conclude that there is not enough evidence to suggest the variances are different. Independent Observations State the Problem: We would like to test the claim that the mean score of those with intrinsic motivation is the same for those with extrinsic motivation. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations 3. Independent Observations The sample consisted of volunteers and thus subjects may not be independent of one another. However, we will assume independence and proceed with caution. 45 46 Let’s Formalize This Test Into 6 Steps! Full Example: Creativity Data State the Problem: We would like to test the claim that the mean intrinsic score is the same as the extrinsic score. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations 3. Independent Observations Run the Test: 1. First 5 steps. We would like to test the claim that the mean score of the Intrinsic group is different than that of the Extrinsic group. To do this we take a sample of size nI = 24 and nE = 23 and find that 𝑥̅ I = 19.88 points, 𝑥̅ E = 15.74, sI = 4.44, and sE= 5.25 points. Ho: µ𝐼 − µ𝐸 = 0 Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Ha: µ𝐼 − µ𝐸 ≠0 Step 2: Draw and Shade and Find the Critical Value. Step 3: Find the test statistic. (The t value for the data.) 𝑡= (𝑥𝐼 − 𝑥𝐸 ) 𝑠𝑝 1 + 1 𝑛𝐼 𝑛𝐸 = 2.93 Step 4: Find the p-value: P-value 0.0054< .01 Step 5: Key! The sample mean we found is very unusual under the assumption that the group means are equal (µ𝐼 − µ𝐸 ). So we Reject this assumption. That is, we REJECT Ho. 47 8 10/13/2018 Let’s Fill in the P-value (and add a CI)! We would like to test the claim that the mean score of the Intrinsic group is different than that of the Extrinsic group. To do this we take a sample of size nI = 24 and nE = 23 and find that 𝑥̅ I = 19.88 points, 𝑥̅ E = 15.74, sI = 4.44, and sE= 5.25 points. Ho: µ − µ = 0 Full Example: Creativity Data 𝐼 State the Problem: We would like to test the claim that the mean intrinsic score is the same as the extrinsic score. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations 3. Independent Observations Run the Test: 1. First 5 steps. State the Scope and Conclusion. 49 𝐸 Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Ha: µ𝐼 − µ𝐸 ≠0 Step 2: Draw and Shade and Find the Critical Value. Step 3: Find the test statistic. (The t value for the data.) Step 4: Find the p-value: P-value = .0054 Step 5: REJECT Ho 𝑡= (𝑥𝐼 − 𝑥𝐸 ) 𝑠𝑝 1 + 1 𝑛𝐼 𝑛𝐸 = 2.93 Step 6: Conclusion: There is sufficient evidence to suggest that those who receive the Intrinsic treatment have a higher mean score than those who receive the Extrinsic treatment (p-value = .0054 from a two sided t-test). A 99% confidence interval for this difference is (1.29, 7.00). SCOPE: Since this was a randomized experiment, we can conclude that the Intrinsic treatment caused this difference. However, since the study was of volunteers, this inference can only be generalized to the 47 participants. Happiness Study LET’S TRY SOME! For each of these data sets, write up the assumption statement with respect to checking the assumptions for a one or two sample t-test. You may assume the data to be independent. Happiness Data Set Mice Experiment Data Set 5 randomly selected people were asked to rate their happiness on a scale from 1 – 100 on a cloudy day and 8 randomly selected people were asked the same question on a sunny day. QOI: Is the mean happiness of individuals different on a cloudy day than a sunny day? If possible, can we test if cloudy weather causes a change in happiness? All data sets can be found in one file in this week’s materials. You will need to add the proc ttest statement for each. However, you will not need the data for this exercise. Address each assumption of the two sample t-test and then decide if the two-sample ttest is appropriate to answer this QOI with this data. 51 Happiness Study Mice Study Normality of Distributions: Judging from the histograms and q-q plots, there is evidence of outliers in both the Cloudy and Sunny sets. The most pronounced outlier seems to be in the Sunny data set; thus, there is significant visual evidence against these data being normally distributed. In addition, we are not satisfied that the t-test will be robust to this assumption since the sample sized are so small. A large sample of mice were randomly assigned to receive a drug or a placebo (sample size nD = 32 and nP = 32). The mice’s tcell counts were then taken and histograms and q-q plots are displayed above. Equal Standard Deviations: Judging from the histograms, q-q plots and box plots, there is significant visual evidence that the standard deviations are different. In addition, since the sample sizes are different we know that the t-test is not robust to this assumption. Independence: We will assume that these data are independent. The two sample t-test is not appropriate here. We should look for a different test. 52 53 QOI: Is the mean tcell count of mice that receive the drug greater than that of the mice that receive the placebo? Can we draw draw evidence of causality from this study? Address each assumption of the two sample t-test and then decide if the two-sample ttest is appropriate to answer this QOI with this data. 54 9 10/13/2018 Mice Study Normality of Distributions: Judging from the histograms and q-q plots, there is significant visual evidence to suggest the data come from right skewed distributions. However, since the sample size is large nD = 32 and nP = 32 the t-test is robust to this assumption violation. Transformations Equal Standard Deviations: There is strong visual evidence to suggest that the data come from distributions with different standard deviations. However, since we have the same sample size in each group, the t-test is robust to this assumption violation, by a previous “rule of thumb”. Independence: We will assume that these data are independent. The two sample t-test is appropriate here. 55 56 Log Transformation Appropriate Interpretations After a Log Transformation – Example Write Ups…. Observational Study: “It is estimated that the median for population X is exp(mean(log(x)) – mean(log(y))) times as large as the median for population Y.” Randomized Experiment: “It is estimated that the median response of an experimental unit to treatment x will be exp(mean(log(x)) – mean(log(y))) times as large as its response to treatment y.” Cloud Seeding! Does Cloud Seeding Work? On days that were deemed suitable for cloud seeding, a random mechanism was used to decide whether to seed the target cloud on that day or to leave it unseeded as a control. Precipitation was measured as the total rain volume falling from the cloud base following the airplane seeding run, as measured by radar. We would like to test at the alpha = .05 level of significance whether cloud seeding is effective in increasing precipitation. 10 10/13/2018 After Log Transformation Cloud Seeding: Original Data T Test and Confidence!!! H0: Cloud Seeding does not work. H1: Cloud Seeding does work. H0: Medianseeded = Medianunseeded H1: Medianseeded > Medianunseeded e0.3904 = 1.5, e1.8972 = 6.7 Cloud Seeding Book Example Original 0 500 1000 1500 Rainfall (acre-feet) 2000 2500 Figure 1. Box Plots of Cloud Seeding Data. Unseeded Logged Seeded Log(Rainfall) For confidence interval. It is estimated that the median volume of rainfall on days when clouds were seeded was e1.1438=3.1 times as large as when not seeded (p-value = .007). A 90% confidence interval for this multiplicative effect on the median is 1.5 to 6.7 times. Since randomization was used to determine whether any particular suitable day was seeded or not, it is safe to interpret this as evidence that the seeding caused the larger median rainfall. Recap: The Take Away 0 2 For the one sided test. 4 6 8 Figure 2. Box Plots of Log-Transformed Cloud Seeding Data. Unseeded Seeded Appendix What you will find in practice will most likely not fit exactly into the scenarios we identified here. There will be some judgment involved … this is the “art” of statistics. Here are some general rules of thumb that we will assume this semester. 1. If sample sizes are the same and sufficiently large, the t tools (tests and confidence intervals) are valid … since they are robust to the violation of normality. 2. If the two populations have the same standard deviation then the t tests are valid … given sufficient sample sizes. 3. If the standard deviations are different and the sample sizes are different then the t tools are not valid and another procedure should be used. (Ch. 4) 65 66 11 10/13/2018 Log Transformations: Theory Prop 1: Log(x) Prop 3: Log(y) Because data is now symmetric (median =mean) Mean[log(x)] = Median[log(x)] The logarithm is a monotonically increasing function. If X1 > X2 then log(X1) > log(X2). Therefore consider X1 through X5 in ascending order so that X1 < X2 < X3 < X4 < X5. Then log(X1) < log(X2) < log(X3) < log(X4) < log(X5). X Log(X) X1 log(X1) X2 log(X2) X3 log(X3) X4 log(X4) X5 log(X5) 𝑒 = Log (base 10) Transformations: Theory Prop 1: Prop 4a: 𝑒 ( ) =𝑋 Derivation: 𝑀𝑒𝑎𝑛(log 𝑋 ) − 𝑀𝑒𝑎𝑛(𝑙𝑜𝑔 𝑌 ) = 𝛿 Diff of means on log scale 𝑀𝑒𝑑𝑖𝑎𝑛(log 𝑋 ) − 𝑀𝑒𝑑𝑖𝑎𝑛(𝑙𝑜𝑔 𝑌) = 𝛿 Prop 1 log 𝑀𝑒𝑑𝑖𝑎𝑛 𝑋 − log(𝑀𝑒𝑑𝑖𝑎𝑛 𝑌) = 𝛿 Prop 2 =𝛿 ( ) ( ) =𝑋 68 𝑋 log 𝑋 − log 𝑌 = log ( ) 𝑌 Prop 2: = ( ) 10 log(Median(X)) = Median(log(X)) log(Median(X)) = Median(log(X)) ( ) ( ) =𝑋 log(Median(X)) = log(X3) = Median(log(X)) Prop 3: Mean[log(x)] = Median[log(x)] Therefore: log 𝑒 =𝑒 Prop 4b: 67 Prop 1: ( ) ( ) 𝑒 ( ) e is a pretty remarkable number!: Log (base e) Transformations: Theory log Prop 4a: 𝑋 log 𝑋 − log 𝑌 = log( ) 𝑌 Mean[log(y)] = Median[log(y)] Prop 2: Log Transformations: Theory y x ( ) ( ) Prop 2: Prop 4b: log(Median(X) = Median(log(X)) 10 ( ) =𝑋 Derivation: 𝑀𝑒𝑎𝑛(log 𝑋 ) − 𝑀𝑒𝑎𝑛(𝑙𝑜𝑔 𝑌 ) = 𝛿 Diff of means on log scale 𝑀𝑒𝑑𝑖𝑎𝑛(log 𝑋 ) − 𝑀𝑒𝑑𝑖𝑎𝑛(𝑙𝑜𝑔 𝑌) = 𝛿 Prop 1 log 𝑀𝑒𝑑𝑖𝑎𝑛 𝑋 − log(𝑀𝑒𝑑𝑖𝑎𝑛 𝑌) = 𝛿 Prop 2 ( ) ( ) log Prop 4a Therefore: log 10 = 10 10 10 = Prop 3 =𝛿 ( ) ( ) ( ) ( ) = ( ) ( ) Prop 4b Full Example: SSHA Data The Survey of Study Habits and Attitudes (SSHA) is a psychological test designed to measure the motivation, study habits, and attitudes toward learning of college students. These factors, along with ability, are important to explain success in school. Scores on the SSHA range from 0 to 200. A selective private college gives the SSGA to an SRS of both male and female first-year students. The data for the women are as follows: 156 109 137 115 152 140 154 178 111 123 126 126 137 165 129 200 150 140 116 120 130 131 130 140 142 117 118 145 130 145 The data for men are as follows: 118 140 114 180 115 126 92 169 139 121 132 75 88 113 151 70 115 187 114 116 117 145 149 150 120 121 117 129 92 110 Most studies have found that the mean SSHA score for men is lower than the mean score in a comparable group of women. Test this claim at the alpha = .05 level of significance. (Show all 6 steps.) H0: w = m H1: w > m 𝑋 log 𝑋 − log 𝑌 = log ( ) 𝑌 Prop 3 FULL EXAMPLE: SSHA Data Prop 3: Mean[log(x)] = Median[log(x)] 71 State the Problem: We would like to test the claim that the mean SSHA score of men is less than that of women. Check Assumptions: 1. Normally Distributed Populations 72 12 10/13/2018 First Check …. q-q Plot Histograms The q-q plots for both populations look sufficiently normal. We look at the histograms as well … but there is not sufficient evidence here to suggest that they are not normal. 73 Normality Assumption • Keeping in mind the relative small sample size from each population, we do not observe any extreme outliers and observe a pretty strong bell shape which lends evidence to support normality of the populations. 74 Full Example: SSHA Data State the Problem: We would like to test the claim that the mean SSHA score of men is less than that of women. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations Visual inspection of the histograms and q-q plots of each population is consistent with the normality of each population. We assume normality and move on to the second assumption. 75 Equality of Variances A visual check was done by looking at the histograms which reveal similar shapes and support the equal variances assumption. You can assume equal variances here. Since we are able to assume normal population distributions, we can use the F-Test to provide secondary evidence if the visual is inconclusive. Since the p-value is greater than our significance level of alpha = 0.05, we fail to reject the null hypothesis of equality (p-value = 0.1043) of variances and conclude that there is not enough evidence to suggest the variances 77 are different. 76 Full Example: SSHA Data State the Problem: We would like to test the claim that the mean SSHA score of men is less than that of women. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations 3. Independent Observations 78 13 10/13/2018 Independent Observations Full Example: SSHA Data State the Problem: We would like to test the claim that the mean SSHA score of men is less than that of women. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations 3. Independent Observations Run the Test: 1. First 5 steps. The sample was indeed a SRS (simple random sample) from the population of the selective private college, therefore we assume the observations are independent of one another. 79 80 Run The Two Sample T-Test!!! Critical Value • There is no reason to pair these observations and we have two samples …. Therefore we should use the two sample t-test with pooled standard deviation since we are assuming the population standard deviations are equal. We are testing here: 𝛼 = .05 = significance level. 𝑥̅ W - 𝑥̅ M .05 df = 60 – 2 = 58 0 𝑡. , = 1.67 H0: 𝜇 = 𝜇 H1: 𝜇 > 𝜇 81 82 Let’s Formalize This Test Into 6 Steps! Two Sample T-Test … SAS Output We would like to test the claim that the mean SSHA score of the men is less than the mean score of women. To do this we take a sample of size nM = 30 and nW = 30 and find that 𝑥̅ M = 124.2 points, 𝑥̅ W = 137.1 and sM = 27.2 sW= 20.2 points. Ho: µ𝑊 − µ𝑀 = 0 Step 1: Identify the null (Ho) and alternative (Ha) hypothesis. Ha: µ𝑊 − µ𝑀 > 0 Step 2: Draw and Shade and Find the Critical Value. Step 3: Find the test statistic. (The t value for the data.) 𝑡= (𝑥̅ 𝑊 − 𝑥̅ 𝑀) 𝑠𝑝 1 + 1 𝑛𝐼 𝑛𝐸 = 2.08 Step 4: Find the p-value: P-value = .0211 Step 5: REJECT Ho. 83 14 10/13/2018 Full Example: SSHA Data Scope State the Problem: We would like to test the claim that the mean SSHA score of men is less than that of women. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations 3. Independent Observations Run the Test: 1. First 5 steps. State the Scope and Conclusion. Since the study is between women and men, the subjects cannot be randomly assigned to the two groups, and we have an observational study. For this reason, we cannot make any causal inference and must limit our conclusions to differences of group means. However, the sample was an SRS and thus any results can be inferred back to the population of students at this particular private college. 85 86 Two Sample T-Test … SAS Output Conclusion There is sufficient evidence to support the claim at the α=.05 level of significance (pvalue = .0211) that the mean SSHA score is lower for men than for women at this college. A 95% one side confidence interval for this difference is (2.5238 points, ∞.) Scope of Inference: Since the study is between women and men, the subjects cannot be randomly assigned to the two groups, and we have an observational study. For this reason, we cannot make any causal inference and must limit our conclusions to differences of group means. However, the sample was an SRS, and thus any results can be inferred back to the population of students at this particular private college. 87 88 FULL EXAMPLE: Promotion Data ANOTHER FULL EXAMPLE The Revenue Commissioners in Ireland conducted a contest for promotion. The ages of the unsuccessful and successful applicants are given below. Some of the applicants who were unsuccessful in getting the promotion charged that the competition involved discrimination based on age. Treat the data as samples from larger populations and use a .05 significance level to test the claim that the unsuccessful applicants are from a population with a greater mean age than the mean age of successful applicants. Based on the result, does there appear to be discrimination based on age? (Show all 6 steps.) Assume all data comes from a normally distributed population. Unsuccessful Applicants: 34 37 37 38 45 60 62 55 Successful Applicants 27 33 36 37 43 44 80 46 51 51 89 41 46 56 42 65 70 43 49 64 44 65 44 53 45 54 38 44 47 52 38 44 75 54 39 45 48 42 70 72 42 71 49 43 72 49 H0: U = S H1: S < U 90 15 10/13/2018 Full Example: Promotion Data First Check …. q-q Plot Successful Unsuccessful State the Problem: We would like to test the claim that the mean of the successful group is less than the mean of the unsuccessful group. Check Assumptions: 1. Normally Distributed Populations The q-q plot for the successful data provides some evidence of non normality, while the q-q plot for the unsuccessful data looks consistent with normally distributed data. 91 Histograms • 92 Normality Assumption The successful group (top) has a clear right skew to the data, while the unsuccessful group shows a possible mild right skew. This suggests that both sets of data may be from right skewed populations. We know that the t-tools are robust to non normality for these types of distributions so we proceed with the t test…. We will readdress these concerns when we talk about the standard deviation. 93 Visual Inspection of the histograms and q-q plots indicates the both data sets may be from a right skewed distribution. We know that the t-tests are robust to violations of the normality assumption when the data are from a right skewed distribution (when the sample size is sufficient), so we proceed with the t-test. 94 Equality of Variances Full Example: Promotion Data State the Problem: We would like to test the claim that the mean of the successful group is less than the mean of the unsuccessful group. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations A visual check was done by looking at the histograms, which reveal similar shapes and support the equal variances assumption. We will assume equal variances here. 95 As secondary evidence of the visual is inconclusive, given that the p-value is greater than our significance level of alpha = 0.05, we fail to reject the null hypothesis of equality of variances (p-value = 0.2286) and conclude that there is not enough evidence to suggest the 96 variances are different. 16 10/13/2018 Full Example: Promotion Data Independent Observations State the Problem: We would like to test the claim that the mean of the successful group is less than the mean of the unsuccessful group. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations 3. Independent Observations The sample was indeed a SRS (simple random sample) from the population of the selective private college, therefore we assume the observations are independent of one another. 97 98 Full Example: Promotion Data Run The Two Sample T-Test!!! State the Problem: We would like to test the claim that the mean of the successful group is less than the mean of the unsuccessful group. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations 3. Independent Observations Run the Test: 1. First 5 steps. • There is no reason to pair these observations, and we have two samples. Therefore, we should use the two sample t-test with a pooled standard deviation, since we are assuming the population standard deviations are equal. We are testing here: H0: s = u H1: s < u 99 Two Sample T-Test … SAS Output Full Example: Promotion Data State the Problem: We would like to test the claim that the mean of the successful group is less than the mean of the unsuccessful group. Check Assumptions: 1. Normally Distributed Populations 2. Equal Standard Deviations 3. Independent Observations Run the Test: 1. First 5 steps. State the Scope and Conclusion. H0: s = u H1: s < u Fail to reject the null hypothesis at 0.05 level. 100 101 102 17 10/13/2018 SCOPE Conclusion Since the study is between successful and unsuccessful candidates for a promotion, subjects cannot be randomly assigned to the two groups, and we have an observational study. For this reason we cannot make any causal inference and must limit our conclusions to differences of group means. However, the sample was an SRS and thus any results can be inferred back to candidates for promotion from the population that the Revenue Commissioners of Ireland sampled. There is not sufficient evidence to support the claim at the α=.05 level of significance (p-value = .4357) that the mean age of those who were given a promotion is lower than those who were not given the promotion in this . A 90% confidence interval for this difference is (-6.3 points, 5.2 points.) 103 104 18 Part IV Alternatives to the t tools 93 Chapter 15 Problem 2: Logging problem We are doing rank sum analysis 15.1 Complete Rank-Sum Analysis Using SAS Problem Statement We would like to test the claim that logging burned trees increased the percentage of seedlings lost in the Biscuit Fire region from 2004 to 2005. Assumptions Independence The two-sample Wilcoxon Rank-Sum test assumes that the samples are independent. In this case, the two sets of tree plots are independent of each other, the amount of tree seedlings in one plot is not directly related to the amount of tree seedlings in another, if it is, it is not a tangible amount of dependence. Therefore, we can assume independence. We can also assume ordinality with numericla data Statement of the Hypothesis Our null hypothesis, H0 , is that the distribution of percent of saplings lost in the logged plots is less than or equal to the distribution of percent of saplings lost in the unlogged plots. Our alternative hypothesis, H1 , is that the distribution of percent of saplings lost in the logged plots is greater than the distribution of percent of saplings lost in the unlogged plots. Mathematically speaking, we have: H0 :meanRanklogged − meanRankunlogged ≤ 0 (15.1.1) H1 :meanRanklogged − meanRankunlogged > 0 (15.1.2) The significance level, α, is: α = 0.05 (15.1.3) Calculation of the P-value To find the p value, I performed a Wilcoxon Rank-Sum test. Because the sample size is small, an exact test was used, as there is no need for a normal approximation. The code used to perform the test is as follows: Code 15.1. Exact rank sum test using SAS /* We want the wilcoxon test and the Hodges-Lehman Confidence Interval*/ proc NPAR1WAY data=loggingData Wilcoxon HL; class Action; Var PercentLost; /* Because our sample size is small, we want to do an Exact test*/ Exact; run; The output of this code is displayed in Figure 2.1: 94 Analysis Guide Midterm Figure 15.1.1. Results of the Rank-Sum Test on the Logging Data The calculated p value is p = 0.0058 (15.1.4) p = 0.0058 < α = .05 (15.1.5) Results of the Hypothesis Test We have that: Therefore, we Reject the Null Hypothesis There is sufficient evidence at the α = 0.5 significance level (p − value = 0.0058 for the exact test) to suggest that the distribution of percentages of saplings lost in the logged plots was greater than the distribution of percentages of saplings lost. Statistical Conclusion MEDIANS FOR NONPAR The data provides convincing evidence that forest recovery is decreased in areas where burned trees were logged. At a significance level of .05 (or even .01), the distribution/MEDIAN of the percentage of saplings lost in the logged plots was greater than that of the unlogged areas. This was done with a one sided, exact p-value of 0.0058. A range of plausible values (95 % confidence interval) for how much greater the median loss of saplings was for the logged trees is [10.8,65.1], as displayed in Figure 2.2 Figure 15.1.2. 95% Confidence Interval Note that the negative of these values was taken, because this figure shows U nlogged − Logged. Scope of Inference This study was a random sample of trees in the plots, therefore we can make generalizations about all of the trees in the 16 plots, and say that the areas which were logged had a greater loss of saplings and therefore recovered more poorly than the unlogged areas. However, this was not a randomized experiment, and therefore we cannot make causal inferences. That is, we cannot say that the logging of burnt trees caused the greater percent loss of saplings. Since the plots were not randomized to receive either the logging or not logging treatment, no causation can be implied here. Since the transect patterns were randomly selected, this inference can be generalized to the 16 larger plots. 95 Analysis Guide Midterm Confirmation Using R In this section we confirm our findings using R. The R code input is shown below: Code 15.2. wilcoxon rank sum test using R 1 2 3 4 5 loggingData <- read.csv("Data/Logging.csv",header=TRUE , sep=",") wilcox.test(PercentLost ~ Action , data = loggingData , exact = TRUE , alternative = "greater") And the output: 1 Wilcoxon rank sum test 2 3 4 5 data: PercentLost by Action W = 55, p-value = 0.005769 alternative hypothesis: true location shift is greater than 0 The results of the two programs are identical! 96 Chapter 16 Problem 3: Welch’s Two Sample T-Test with Education Data 16.1 Problem Statement and Assumptions Problem Statement We would like to examine the claim that the mean income of college educated people (16 years of education) is greater than the mean income of people with only a high school education (12 years of education) Assumptions The code used to produce everything in this section is shown below: Code 16.1. welch’s t test proc ttest data=edudata order=DATA sides=U; /*an Upper tailed test*/ class Educ; var Income2005; run; Normality Figure 3.1 shows histograms and Box plots relating to the data: Figure 16.1.1. Histograms and Box plots As we can see from the figure, the data is not normal, it is heavily right skewed in both cases. Both the histograms and the Box plots show this, as the histograms are way taller on the left side than on the right, while the box plots show that there is a bunch of data on the left with a ton of outliers, clearly not normal. We examine this further with the Q-Q plot in Figure 3.2 97 Analysis Guide Midterm Figure 16.1.2. Q-Q Plot The Q-Q plot conifrims our findings that the data is not very normal. However, the sample sizes are 400 and 1000, which means that we can definitely apply the central limit theorem. This means that we can treat the data as normal, we will assume normality. Independence We will assume independence in this case. 16.2 Complete Analysis Using SAS Statement of Hypotheses H0 :µ16yeareduc − µ12yeareduc ≤ 0 (16.2.1) H1 :µ16yeareduc − µ12yeareduc > 0 (16.2.2) Critical t Value With α = .05 and a one sided test, the critical t value (with the appropriate degrees of freedom) is calculated using the code shown below. data critval; p = quantile("T",.95,473.85); /*one sided test*/; proc print data=critval; run; The critical t value is shown in Figure 3.3: Figure 16.2.1. Critical t-value The critical t value is t = 1.64. This is illustrated using the following bit of SAS code: data pdf; do x = -4 to 4 by .01; pdf = pdf("T", x, 473.85); lower = 0; if x >= quantile("T",0.95,473.85) then upper = pdf;/*one sided*/ else upper = 0; output; end; run; title ’Shaded t distribution’; proc sgplot data=pdf noautolegend noborder; yaxis display=none; band x = x lower = lower upper = upper / fillattrs=(color=gray8a); series x = x y = pdf / lineattrs = (color = black); series x = x y = lower / lineattrs = (color = black); run; 98 Analysis Guide Midterm This produces Figure 3.4 Figure 16.2.2. Shaded t Distribution Calculation of the t Statistic To calculate Welch’s t Statistic, we use the code seen in Section 3.a.2, giving us a t value of t = 9.98, as seen in Figure 3.5 Figure 16.2.3. Results of Welch’s t-test We see that in this case, we have a t-value of 9.98 Calculation of the p Value We also see from Figure 3.5 that p = 0 Results of Hypothesis Test We have that p = 0 < α = .05 and therefore we reject the null hypothesis Conclusion We have convincing evidence that the mean income of people with an education of 16 years is greater than the mean income of people with an education of 12 years. A one sided p-value of zero shows us that the means are truly different. The figure below shows a one sided 95% confidence interval on our data: Figure 16.2.4. Confidence Interval on the Difference of Means The confidence interval on the difference of means is [27662.2, ∞). This estimates what is a plausible difference between the means of the two samples. As we can see, the distribution of income of the sample with a 16-year education is at least $27,000 greater than the distribution of income of the sample with a 12-year education. 99 Analysis Guide Midterm Scope of Inference This was an observational study; therefore, we cannot conclude that the extra education caused the change (increase) in mean incomes. Households were selected from a random sample of a previously selected “area of the United States” and the subjects in this study are the members of those households. Therefore, since every member of the “area” had the same chance of being selected, it is a random sample of the “areas.” However, no indication is given on how the “areas” were selected. In conclusion, the association between education and income above can be generalized to all the members of the “areas” that were selected for this study, but not generalized to the U.S. as a who Verification using R The following R code was used to verify the analysis 1 2 3 4 5 eduData <- read.csv("Data/EducationData.csv",header=TRUE , sep=",") t.test(Income2005 ~ Educ , data = eduData , # we use less because R is doing 12 - 16 # alternative = "less") This gives the following output: 1 Welch Two Sample t-test 2 3 4 5 6 7 8 9 10 data: Income2005 by Educ t = -9.9827 , df = 473.85 , p-value < 2.2e-16 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -27662.19 sample estimates: mean in group 12 mean in group 16 36864.90 69996.97 Note that R is telling us that the distribution of income of the sample with a 12 year education is at least 27,000 less than those with a 16 year education Preferences I prefer the log transformed analysis, they both assume normality, however the log transformed analysis has the more actually normal data to start with, and the variances are roughly equal. It also speaks more to the medians, instead of the means, which is much more robust to the huge number of outliers. I think because of the outliers, I definitely prefer the log method, as the mean is not such a good measurement with these crazy outliers. 100 Chapter 17 Problem 4: Trauma and Metabolic Expenditure rank sum 17.1 Hand-Written Calculations To summarize, T = 82, µ(T ) = 56, sd(T ) = 8.632 The handwritten work was done before the author understood continuity correction, the continuity corrected Z and P values were calculated as follows: Z= (T − 0.5) − mean(T ) = 2.95 SD(T ) → p =.001568 With a continuity correction of 0.5 101 (17.1.1) (17.1.2) Analysis Guide 17.2 Midterm SAS verification To verify the Z and p values calculated in Section 4.a, the following SAS code was run: proc NPAR1WAY data=TraumaStudy Wilcoxon HL; class PatientType; Var MetabolicEx; run; The results of this code are shown in Figure 4.1 Figure 17.2.1. Continuity Corrected Wilcoxon Test Using SAS The Results of the two tests are the same! Note that if you add the phrase "correct=no" to the proc NPAR1WAY statement, you get the same values as the non corrected ones in the handwritten work 17.3 Full Statistical Analysis Problem Statement We would like to test the claim that the Trauma patients had higher metabolic expenditures/ Assumptions The Wilcoxon Rank-Sum test only assumes the data are independent, which in this case we will assume independence because the patients were not related to each other in any way, or at least their metabolic expenditures aren’t dependent on the other people’s metabolic expenditures. ALSO obviously normal Hypothesis definitions H0 :meanRankT rauma − meanRankN onT rauma ≤ 0 (17.3.1) H1 :meanRankT rauma − meanRankN onT rauma > 0 (17.3.2) In other words, the null hypothesis is that the nontrauma and trauma patients have equal distributions of metabolic expenditures, while the alternative hypothesis claims that the distribution of the trauma patients’ metabolic expenditures is higher. We are using a one sided hypothesis test because that is what the book calls for. In this scenario, we will say α = 0.05 104 Analysis Guide Midterm Critical Value The critical value was calculated using the following chink of SAS code: data critval; p = quantile("Normal",.95); /*one sided test*/; proc print data=critval; run; Producing a critical t value of t = 1.64485 Figure 17.3.1. Critical Value The critical value is shown on a normal distribution using the following bit of SAS code data pdf; do x = -4 to 4 by .01; pdf = pdf("Normal", x); lower = 0; if x >= quantile("Normal",0.95) then upper = pdf;/*one sided*/ else upper = 0; output; end; run; title ’Shaded Normal distribution’; proc sgplot data=pdf noautolegend noborder; yaxis display=none; band x = x lower = lower upper = upper / fillattrs=(color=gray8a); series x = x y = pdf / lineattrs = (color = black); series x = x y = lower / lineattrs = (color = black); run; The shaded distribution is displayed in Figure 4.3 Figure 17.3.2. Shaded Normal Distribution Calculation of the z statistic Our z statistic, calculated in Sections 4.a and 4.b is 2.95. Calculation of the p value Our p-value, calculated in Sections 4.a and 4.b is 0.0016 Discussion of the hypothesis We Reject the null hypothesis, p = .0016 < 0.5 = α 105 Analysis Guide Midterm Conclusion We have convincing evidence that the distribution of metabolic expenditure of trauma patients is than the nontrauma patients (p=0.0016 on a one sided Wilcoxon rank-sum test). The figure below shows a 95% Hodges-Lehmann confidence interval on the difference of the two distributions: Figure 17.3.3. 95% Confidence Interval This tells us that a plausible difference between the two distributions is between 1.9 and 16.7. As we can see this does not include the null hypothesis which says their difference is less than or equal to zero. This cannot give us causal or population inferences because it was neither a randomized experiment nor a random sample ALSO MEDIANS DUH 106 Chapter 18 Problem 5: Autism and Yoga signed rank 18.1 Hand-Written Calculations The results of the calculations are as follows: S = 41, µS = 22.5, SDS = 8.4409, The Z value on the paper is incorrect, as it does not correct for continuity. So, here we will aplply the continuity correction: S − 0.5 − S̄ SDS (18.1.1) 40.5 − 22.5 = 2.13 → poneT ail = .0166ptwoT ail = .033 8.4409 (18.1.2) z= z= 107 Analysis Guide 18.2 Midterm Verification in SAS and R Verification in SAS To verify this, the following bit of SAS code was employed: Producing: Code 18.1. Signed Rank test in SAS data Autismdiff; set Autism; diff= Before-After; run; proc univariate data=Autismdiff; var diff; run; Figure 18.2.1. Signed Rank Test In SAS This two sided p value of 0.0313 is the same as a one sided p value of .01565, and a z value of 2.15. It is slightly different with my calculations and SAS’s because they didnt use a normal approximation, I did. Verification in R This R code was employed for the same purposes: AutismData <- read.csv("Data/Autism.csv",header=TRUE , sep=",") wilcox.test(AutismData\$Before , AutismData\$After , paired = TRUE , alternative = "greater", conf.int=TRUE) 1 2 3 4 5 Yielding: Wilcoxon signed rank test with continuity correction 1 2 data: AutismData\$Before and AutismData\$After V = 41, p-value = 0.01618 alternative hypothesis: true location shift is greater than 0 95 percent confidence interval: 4.999993 Inf sample estimates: (pseudo)median 17.49993 3 4 5 6 7 8 9 10 The R code applied a continuity correction, instead of doing the exact permutation like SAS. Their P value corresponds with a Z score of 2.139 18.3 6 step Sign Rank test using SAS Statement of Hypothesis H0 :MedianBef ore − MedianAf ter ≤ 0 (18.3.1) H1 :MedianBef ore − MedianAf ter > 0 (18.3.2) We will say that α = .05 and we are doing a one sided test Critical Values The critical value was calculated using the following chunk of SAS code: data critval; p = quantile("Normal",.95); /*one sided test*/; proc print data=critval; run; Producing a critical t value of t = 1.64485 109 Analysis Guide Midterm Figure 18.3.1. Critical Value The critical value is shown on a normal distribution using the following bit of SAS code data pdf; do x = -4 to 4 by .01; pdf = pdf("Normal", x); lower = 0; if x >= quantile("Normal",0.95) then upper = pdf;/*one sided*/ else upper = 0; output; end; run; title ’Shaded Normal distribution’; proc sgplot data=pdf noautolegend noborder; yaxis display=none; band x = x lower = lower upper = upper / fillattrs=(color=gray8a); series x = x y = pdf / lineattrs = (color = black); series x = x y = lower / lineattrs = (color = black); run; The shaded distribution is displayed in Figure 5.3 Figure 18.3.2. Shaded Normal Distribution Calculation of a Z statistic We will use the Z statistic calculated using R/by hand,Z = 2.13, however it will not have a huge effect on the outcome of the test Calculation of a p value For our z value, a one sided p value is p = 0.016. Assessment of hypothesis p = .016 < α = .05 →We reject the null hypothesis. Conclusion We have conclusive evidence that the median time to complete the puzzle for Autistic children is greater before 20 minutes of Yoga than after 20 minutes of Yoga. We cannot infer causality becuase this was not a randomized experiment, and we cannot infer anything about the population because this was not a random sample. The median time for the children was at least 5 seconds longer before Yoga as compared to after Yoga, as seen by the confidence interval displayed in the R output. 110 Analysis Guide 18.4 Midterm Paired t test in SAS Statement of Hypothesis H0 :µbef ore−af ter ≤ 0 (18.4.1) H1 :µbef ore − af ter > 0 (18.4.2) We will say that α = .05 and we are doing a one sided test. Critical Values The critical value was calculated using the following chunk of SAS code: data critval; p = quantile("T",.95,8); /*one sided test*/; proc print data=critval; run; With the following output: Figure 18.4.1. Critical Value With a critical t value of t=1.86. This is demonstrated in a shaded t distribution with the following chunk of code: data pdf; do x = -4 to 4 by .01; pdf = pdf("T", x,8); lower = 0; if x >= quantile("T",0.95,8) then upper = pdf;/*one sided*/ else upper = 0; output; end; run; title ’Shaded Normal distribution’; proc sgplot data=pdf noautolegend noborder; yaxis display=none; band x = x lower = lower upper = upper / fillattrs=(color=gray8a); series x = x y = pdf / lineattrs = (color = black); series x = x y = lower / lineattrs = (color = black); run; The shaded distribution is displayed in Figure 5.5 Figure 18.4.2. Shaded T Distribution Calculation of a t statistic The T statistic was calculated using the following SAS code: The t value is shown in Figure 5.6 Figure 18.4.3. Paired t statistic We have a t value of 2.54. 111 Analysis Guide Midterm Code 18.2. Paired T test in SAS proc ttest data=Autism alpha = .05 sides=U; paired Before*After; run; Calculation of a P value The p value can be seen in Figure 5.6: p = .0173 Assessment of Hypothesis p = .0173 > α = .05 →we reject the null hypothesis. Conclusion We have conclusive evidence that the mean of the differences of times before and after the yoga is greater than zero (p=.0173 on a one sided paired t test). A confidence interval for the mean of the difference of time for the children to finish the puzzle before and after yoga is shown in Figure 5.7: Figure 18.4.4. 95% Confidence interval This means that the mean of the differences was at least 4.9 seconds. We cannot infer causality because this was not a randomized experiment, and we cannot make inferences about the population because this was not a random sample. We also cannot make causal inferences with a paired t test 18.5 Confirmation with R The R code below was used to verify the results of the previous section: t.test(AutismData\$Before , AutismData\$After , paired = TRUE , alternative = "greater", conf.int=TRUE) 1 2 3 4 The output is presented below: Paired t-test 1 2 data: AutismData\$Before and AutismData\$After t = 2.5403 , df = 8, p-value = 0.01735 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 4.913201 Inf sample estimates: mean of the differences 18.33333 3 4 5 6 7 8 9 10 18.6 Complete Statistical Analysis In this section, I will be using a paired t-test, because the data is pretty normal, as we will see in the following section. When both are possible, I believe the paired t test is better because it doesnt mess with the data in any way, we can see the magnitudes etc. Assumptions We can assume the differences are independent because the children did not affect the other children. To check for normality we examine the following figure: 112 Analysis Guide Midterm Figure 18.6.1. Histogram and Box Plot As we see from Figure 5.8, the data is fairly normally distributed. The histogram is heavier in the center than on the edges, and the mean is near the median on the Box plot. We will examine this further in Figure 5.9 Figure 18.6.2. Q-Q Plot As we can see, the data follows the line of normality closely, and therefore we can assume normality. This means that a paired t test is appropriate. Statement of Hypothesis H0 :µbef ore−af ter ≤ 0 (18.6.1) H1 :µbef ore − af ter > 0 (18.6.2) We will say that α = .05 and we are doing a one sided test. Critical Values The critical value was calculated using the following chunk of SAS code: data critval; p = quantile("T",.95,8); /*one sided test*/; proc print data=critval; run; With the following output: Figure 18.6.3. Critical Value With a critical t value of t=1.86. This is demonstrated in a shaded t distribution with the following chunk of code: 113 Analysis Guide Midterm data pdf; do x = -4 to 4 by .01; pdf = pdf("T", x,8); lower = 0; if x >= quantile("T",0.95,8) then upper = pdf;/*one sided*/ else upper = 0; output; end; run; title ’Shaded Normal distribution’; proc sgplot data=pdf noautolegend noborder; yaxis display=none; band x = x lower = lower upper = upper / fillattrs=(color=gray8a); series x = x y = pdf / lineattrs = (color = black); series x = x y = lower / lineattrs = (color = black); run; The shaded distribution is displayed in Figure 5.11 Figure 18.6.4. Shaded T Distribution Calculation of a t statistic The T statistic was calculated using the following SAS code: proc ttest data=Autism alpha = .05 sides=U; paired Before*After; run; The t value is shown in Figure 5.12 Figure 18.6.5. Paired t statistic We have a t value of 2.54. Calculation of a P value The p value can be seen in Figure 5.6: p = .0173 Assessment of Hypothesis p = .0173 > α = .05 →we reject the null hypothesis. Conclusion We have conclusive evidence that the mean of the differences of times before and after the yoga is greater than zero (p=.0173 on a one sided paired t test). A confidence interval for the mean of the difference of time for the children to finish the puzzle before and after yoga is shown in Figure 5.13: Figure 18.6.6. 95% Confidence interval 114 Analysis Guide Midterm This means that the mean of the differences was at least 4.9 seconds. We cannot infer causality because this was not a randomized experiment, and we cannot make inferences about the population because this was not a random sample. We also cannot make causal inferences with a paired t test 115 Chapter 19 sexy ranked permutation test Here is the SAS code I designed to conduct a Ranked permutation test I did not have time to add a normal Code 19.1. handcrafted rank sum test proc import datafile=’c:\Users\david\Desktop\MSDS\MSDS6371\Homework\Week4\Data\Trauma.csv’ out=TraumaStudy DBMS=CSV; run; proc rank data=TraumaStudy out=Ranked ties=mean; var MetabolicEx; ranks rank; run; proc print data=Ranked; run; proc iml; use Ranked var {PatientType rank}; /*making two groups in IML*/ read all var {rank} where(PatientType=’Nontrauma’) into g2; read all var {rank} where(PatientType=’Trauma’) into g1; obsdiff = sum(g1) - sum(g2); print obsdiff; call randseed(12345); /* set random number seed alldata = g1 // g2; /* stack data in a single vector N1 = nrow(g1); N = N1 + nrow(g2); NRepl = 5000; /* number of permutations nulldist = j(NRepl,1); /* allocate vector to hold results do k = 1 to NRepl; x = sample(alldata, N, "WOR"); /* permute the data */ nulldist[k] = sum(x[1:N1]) - sum(x[(N1+1):N]); /* difference of sums */ end; */ */ */ */ title "Histogram of Null Distribution"; refline = "refline " + char(obsdiff) + " / axis=x lineattrs=(color=red);"; call Histogram(nulldist) other=refline ; pval = (1 + sum((nulldist) >= (obsdiff))) / (NRepl+1); /*this means one sided test print pval; quit; curve to my figure, however, the p value is more or less the same as the wilcoxon test however it is a more reasonable number. 116 Analysis Guide Midterm Figure 19.0.1. Permutation Test 117 Chapter 20 Unit 4 lecture slides Here it is 118 10/13/2018 Let’s Start With an Example IBM gives each employee in the marketing department technical training Based on further testing, it appears the traditional training method isn’t effective Hence, a new training method is developed Below are the test scores of 4 individuals who just finished the “New Method” and the last 3 test scores from employees trained via the “Traditional Method” course • Is there evidence to suggest that the “New Method” increases test scores? • • • • Alternatives to (Student) t-Tools New Method 37 49 55 77 RANK SUM TE ST W E LCH’S TE ST SIGN TE ST / SIG NE D RANK TE ST Traditional Method 23 31 46 2 Examining the t-Tools Assumptions 2 2 1 1 Which situation does it appear we are in? 1 Since the standard deviations appear (visual check) to be different and the sample sizes are both different and exceptionally small, the t-test was not deemed appropriate and the nonparametric rank sum test was performed. Using a t-test could have low power. 4 Nonparametric Methods • A NONPARAMETRIC or DISTRIBUTION-FREE test doesn’t depend on underlying assumptions Nonparametric Methods: The Rank Sum Test • This makes them ideal for use when the assumptions of non-nonparametric (that is, PARAMETRIC) tests aren’t met • The trade-off is that nonparametric methods perform somewhat worse than parametric methods if the assumptions are approximately correct • The first nonparametric method we will consider is the ”rank sum test” 5 1 10/13/2018 Rank Sum Test: Advantages The Hypothesis Test • No distributional assumptions • Resistant to outliers •Performs nearly as well as the t-test when the two populations are normal and considerably better when there are extreme outliers •Works well with ORDINAL (as opposed to interval data) •Works with censored values •It still requires some assumptions: 1. 2. All observations are independent The Y values are ordinal (TWO SIDED) 59 patients with arthritis who participated in a clinical trial were assigned to two groups, active and placebo. The response status: (excellent=5, good=4, moderate=3, fair=2, poor=1) of each patient was recorded. The Rank Sum test (ONE SIDED) The Sampling Distribution of … • We can compute the rank sum test statistic using the following steps: 1. 2. 3. 4. List all observations from both groups in increasing order Note: n is the total # of observations Assign each observation a rank, from 1 to n If there are any ties, assign each tied observation’s rank to be the average of their ranks. Identify each observation by its group The Rank Sum Statistic! Rank Sum test statistic (sum of ranks of one group) is approximately normally distributed! • The test statistic, T, is the sum of the ranks in one of the groups. •We can find a p-value in two ways: • Normal approximation • Re-randomization (exact or approximate) Rank-Sum Test: Normal Approximation Rank Sum Test: randomly assign ranks Name Order # Bob 1 Sue 2 Fred 3 Jim 4 Pam 5 Tim 6 Zac 7 Group Rank New 5 New 7 New 2 New 1 Trad 3 Trad 4 Trad 6 Name Order # Sue 1 Bob 2 Fred 3 Jim 4 Pam 5 Tim 6 Zac 7 Group Rank New 7 New 5 New 2 New 1 Trad 3 Trad 4 Trad 6 … Name Order # Pam 1 Tim 2 Sue 3 Zac 4 Fred 5 Bob 6 Jim 7 Group Rank New 3 New 4 New 7 New 6 Trad 2 Trad 5 Trad 1 Record sum of ranks of one group (e.g. “Trad.”) for all 7! permutations of ranks. (7!=7*6*5*4*3*2*1=5040) P-value is the number of permutations with a sum equal to or more extreme than the one in the original data set divided by the total number of permutations. *Could also do an approximate p-value by randomly choosing, say, 1000 orderings of the data. 2 10/13/2018 Rank-Sum Test: Normal Approximation Rank-Sum Test: Normal Approximation Common interpretation: H0: The distribution of New Method Scores = The distribution of the Traditional Method Scores H1:The distribution of New Method Scores > The distribution of the Traditional Method Scores Common interpretation: H0: The distribution of New Method Scores = The distribution of the Traditional Method Scores H1:The distribution of New Method Scores > The distribution of the Traditional Method Scores Technical mathematical interpretation: H0: Average rank of New Method Scores = Average rank of all Scores (constant) H1: Average rank of New Method Scores > Average rank of all Scores (constant) There is mild evidence (alpha = 0.1) to suggest that the distribution of scores from the “New” method is greater than the distribution of the “Traditional” method (normal approximation to rank-sum test p-value = 0.0558). There is mild evidence (alpha = 0.1) to suggest that the distribution of scores from the “New” method is greater than the distribution of the “Traditional” method (normal approximation to rank-sum test p-value = 0.0558). Permutation Test (Exact P-value) Rank Sum Test (Wilcoxon) H0: H1: The distribution of New Method Scores = The distribution of the Traditional Method Scores The distribution of New Method Scores > The distribution of the Traditional Method Scores Normal approximation p-values There is sufficient evidence at the alpha = 0.1 level of significance (p-value = .0571 for the exact test) to suggest that the distribution of scores from four IBM employees that were given the New Method is greater than the distribution of the 3 employees that took the test having had the Traditional Method of instruction. Exact p-values Cognitive Load Experiment Cognitive Load Experiment • Researchers compared the effectiveness of conventional textbook examples to modified ones • They selected 28 ninth-year students who had no previous exposure to coordinate geometry • The students were randomly assigned to one of two self study instructional groups, using conventional and modified instructional materials • After instruction, they were given a test and the time to complete one of the problems was recorded. Is there sufficient evidence to suggest that the cognitive load theory (modified instruction) shortened response times? (CENSORED DATA) 3 10/13/2018 Cognitive Load Experiment Cognitive Load Experiment: Normal Approximation With ties, the ranks are averaged. (CONTINUITY CORRECTION) Statistical Conclusion: The data provide convincing evidence that a student could solve the problem more quickly after the “modified” rather than the the “conventional” method (one-sided, normal approximation w/ C.C. p-value = 0.0013, from the rank-sum test). Cognitive Load Experiment: Using SAS Confidence Interval for the Location Parameter (Median): Hodges Lehman Confidence Interval https://en.wikipedia.org/wiki/Hodges%E2%80%93Lehmann_estimator *We will look at an example later Cognitive Load Experiment Cognitive Load Experiment (All Together) Ho: Distribution of Modified and Conventional Scores are equal Ha: Distribution of Modified Scores is less than that of Conventional Critical Value (left sided): -1.645 (alpha = .05) Test Statistic: z-stat = -3.0183 P-value (left sided)= .0013 Reject Ho Statistical Conclusion (continued): A range of plausible values for how much smaller the “modified” distribution is than the “traditional” (treatment effect) is [-158, -59] s. (95% confidence interval based on a rank-sum test) with a pointestimate of 108.5 s. Statistical Conclusion (continued): The data provide convincing evidence that a student could solve the problem more quickly after the “modified” rather than the “conventional” method (one-sided, normal approximation w/ C.C. p-value = 0.0013, from the rank-sum test). A range of plausible values for how much smaller the “modified” distribution is than the “traditional” (treatment effect) is [-158, -59] sec. (95% confidence interval based on a rank-sum test) with a point-estimate of 108.5 sec. 4 10/13/2018 Creativity Study: Reminder I E Welch’s t-Test What if this assumption isn’t true? 25 Welch’s t-Test Testing Hypothesis: Welch’s t-Tools 28 Gender Income Discrimination Gender Income Discrimination Strong evidence against normality, but CTL applies. Strong evidence against equal standard deviations and different sample sizes. (They are close but the standard deviations appear to be so different that this may make a real difference.) We will assume independence. Student’s t-test not a good idea here. 5 10/13/2018 Rank Sum versus Welch’s … the Take Away Gender Income Discrimination! If you wish to make inference on the difference of means and you have the sample size to invoke the CLT, Welch’s t-test is preferred by most statisticians, and it is robust to different standard deviations even when the sample size is not equal. Often, especially in skewed distributions, the median is a better measure of center. For this reason, one may prefer the rank sum test even when Welch’s t-test is available. Test Statistic: tstat = -3.88 P-value = .0006 Reject H0 If you have small sample sizes, you may not be very confident about the normality assumption even if the histograms and q-q plots look okay. For this reason, one may wish to be “conservative” and run the rank sum test and obtain inference on the median. Conclusion: There is strong evidence to suggest that the mean income of the female group is different from the mean income of the male group (p-value = .0006). A 95% confidence interval for this difference is ($29,124, $94,176) in favor of the males. If there are outliers or censored values, the rank sum test is often the most appropriate as the t-test is not resistant to outliers and has no way of using censored data. That is quite a difference! Performance of Welch’s t-test Paired T-Test Paired T-Test A Look at the Variance Known alternatively as Matched Pairs or Dependent t-Test Assumptions • Data are either: • From one sample that has been tested twice (example pre- and post-test or repeated measures) • From a group of subjects that are thought to be similar and can thus be matched or paired (example from same family, or twins) • Differences are normally distributed, independent between observations (but dependent from one group to the next). •If data can be paired, the variance can be reduced. 35 36 6 10/13/2018 Example: Medical Reasoning Test Example: Keith’s Medical Reasoning Test • The AMA has a diagnostic test for medical reasoning • On average, people score about 500 points on this test • We have data from 10 subjects who took the medical reasoning test. These subjects were randomly selected from St. Paul Hospital in Dallas •Not fatigued: is the baseline, taking the test before a shift •Fatigued: is after the treatment; working for 12 operational hours prior to re-taking the test. Subject # Not Fatigued Fatigued 1 567 530 2 512 492 3 509 510 4 593 580 5 588 600 6 491 483 7 520 512 8 588 575 9 529 530 10 508 490 We can try to test whether the DIFFERENCE OF THE MEANS between the fatigued scores and the not fatigued scores is less than zero. (Lower numbers = worse score) 37 Example: Medical Reasoning Test 38 If we did this, we would be wrong! Why? A fundamental assumption is violated: independence Assumption Check Failure We need to account for the dependence between the two groups 40 39 Example: Keith’s Medical Reasoning Test Paired t-test reduces to a one-sample t-test Instead of testing the DIFFERENCE OF THE MEANS: We should test the MEAN OF THE DIFFERENCES: Subject 1 2 3 4 5 6 7 8 9 10 Fatigued Not Fatigued Difference 530 567 -37 492 512 -20 510 509 1 580 593 -13 600 588 12 483 491 -8 512 520 -8 575 588 -13 530 529 1 490 508 -18 41 Subject 1 2 3 4 5 6 7 8 9 10 Fatigued 530 492 510 580 600 483 512 575 530 490 (di) Not Fatigued Difference 567 -37 512 -20 509 1 593 -13 588 12 491 -8 520 -8 588 -13 529 1 508 -18 H0: d = 0 Ha: d < 0 42 7 10/13/2018 A SAS Code Comparison Two (independent) sample T-Test A SAS Code Comparison Using paired data (when appropriate) instead of unpaired data allows us to tighten the confidence interval for the difference in means (yeah!) AND increase the power (the likelihood that our data properly detects a shift in score). Paired T-test Paired T-test Two (independent) sample T-Test 43 Checking the Assumptions 44 Additional Information • We can look at a PROFILE PLOT • The lines connect the scores on the MRT in the “fatigued” versus “not fatigued” states • This plot is standard for SAS proc ttest with paired data. There is little to no evidence that the differences do not come from a normal distribution. We will assume that the differences are independent. Is this a reasonable assumption? 45 46 Appendix Conclusion (alpha = 0.01) Critical Value: t0.01,9 = -2.821 Test Statistic: tstat= -2.41 P-value = 0.0196 > 0.01 Fail to Reject Ho Statistical Conclusion: There is not enough evidence to suggest that, on average, the fatigued subjects score lower than the non-fatigued subjects (p-value = .0196). A 99% one sided confidence interval for the mean difference in scores is (-infinity, 1.76). Perhaps, a more meaningful confidence interval would be a two-sided 98% confidence interval of (-22.36, 1.76). Scope of Inference: Since this was a random sample from St. Paul Hospital in Dallas, we can infer that this result would be repeated for any group selected from this hospital. There is no way to guarantee a causal inference from a paired t-test. Note: The elusiveness of the causal inference comes from the fact that the treatment that induces fatigue may itself be a confounder. Some may work for 12 hours as a surgeon and others may work 12 hours writing reports. There is reason to believe that if a difference is detected, this difference may not be due to fatigue rather may be due to the type of work. 47 8 10/13/2018 Example: Nerve Data horse 6 4 8 5 7 9 For each of the 9 horses, a veterinary anatomist measured the density 3 1 of nerve cells at specified sites in the intestine. 2 Alternatives to the t-Test for Paired Data Using the paired t-Test site1 site2 14.2 16.4 17 19 37.4 37.6 11.2 6.6 24.2 14.4 35.2 24.4 35.2 23.2 50.6 38 39.2 18.6 The Hypothesis Test (TWO SIDED) (ONE SIDED) The sample size is rather small, hence the normality assumption is somewhat suspect. 52 Sign Test: Horse Data (ONE SIDED, CC P-VALUE) horse 8 4 6 5 7 9 3 1 2 Test and Conclusion site1 37.4 17 14.2 11.2 24.2 35.2 35.2 50.6 39.2 site2 37.6 19 16.4 6.6 14.4 24.4 23.2 38 18.6 diff -0.2 -2 -2.2 4.6 9.8 10.8 12 12.6 20.6 Sign + + + + + + Critical Value (right sided): z0.05=1.645 P-value (one sided) = .2527 t statistic: tstat = 0.666 Fail to Reject H0. Statistical Conclusion: There is not enough evidence that the median nerve density at site 1 is greater than the median nerve density at site 2 (Wilcoxon sign test one-sided p-value of 0.2527). K=6 54 9 10/13/2018 Signed Rank Test: Horse Data (ONE SIDED, CC P-VALUE) horse 8 4 6 5 7 9 3 1 2 site1 37.4 17 14.2 11.2 24.2 35.2 35.2 50.6 39.2 site2 37.6 19 16.4 6.6 14.4 24.4 23.2 38 18.6 Test, Conclusion and Some Notes abs(diff) 0.2 2 2.2 4.6 9.8 10.8 12 12.6 20.6 Sign + + + + + + rank 1 2 3 4 5 6 7 8 9 Critical Value (right sided): z0.05=1.645 P-value (one sided) = .0294 t statistic: tstat = 1.89 Reject Ho. Statistical Conclusion: There is strong evidence that the median nerve density at site 1 is greater than the median nerve density at site 2 (Wilcoxon signed rank test one-sided p-value of 0.0294). Note: S = 39 • The signed-rank test has more power than the sign test (Compare the p-values 0.254 vs. 0.0294) • Both tests make very few assumptions about the distributions 56 Horse Data Note: These are two sided…. Half of this is close to our calculated one sided p-values from earlier. Note: For n < 20 SAS uses the probabilities from the binomial distribution rather than the normal approximation. These are more accurate (exact) and we should use these when SAS is available. 10 Part V ANOVA 129 Chapter 21 Problem 1: Plots and Logged Data We begin our work looking at raw and transformed data. 21.1 Plots and Transformations Raw Data Analysis First, we will look at the raw data. To check if the raw data fits the assumptions, we will first look at a scatter plot. The scatter plot of the raw data was produced by the following bit of SAS code: Code 21.1. Scatterplot of Raw Data Using SAS proc sgplot data=EduData; scatter x=educ y=Income2005; run; This results in the following plot21.1: Figure 21.1.1. Scatter Plot of the Raw Data Looking at Figure 21.1.1, we see that the raw data is very heavy in between 0 and 20,000 for all categories, but some groups spread further and wider than others, which suggests the variances may not be equal. The heaviness of the lower end of each group may also suggest a lack of normality. We will examine this further with some Box plots. These were produced using the following chunk of SAS code: This results in the following plot: 130 Analysis Guide Midterm Code 21.2. Boxplot of Raw Data Using SAS proc sgplot data=EduData; vbox Income2005 / category=educ dataskin=matte ; xaxis display=(noline noticks); yaxis display=(noline noticks) grid; run; Figure 21.1.2. Box Plot of the Raw Data Figure 21.1.2 tells us a lot about our data. We see from the size and shape of the boxes that the variances of our data are by no means homogeneous. Note that there are a lot of outliers while the distribution is heavily weighted towards the bottom, this suggests our data may have departed from normality. We will examine this phenomenaa further using histograms. To produce histograms of the raw data, the following SAS code was used: This results in the following Code 21.3. Histogram of Raw Data Using SAS proc sgpanel data=EduData; panelby educ / rows=5 layout=rowlattice; histogram Income2005; run; plot: 131 Analysis Guide Midterm Figure 21.1.3. Histogram of the Raw Data Figure 21.1.3 confirms our suspicions, the variances of the data are likely unequal, but more importantly, the data is clearly skewed to the right. We will confirm this using Q-Q plots. To produce Q-Q plots of the raw data, the following SAS code was used: Code 21.4. Q-Q of Raw Data Using SAS /* Normal = blom produces normal quantiles from the data */ /* To find out more, look at the SAS documentation!*/ proc rank data=EduData normal=blom out=EduQuant; var Income2005; /* Here we produce the normal quantiles!*/ ranks Edu_Quant; run; proc sgpanel data=EduQuant; panelby educ; scatter x=Edu_Quant y=Income2005 ; colaxis label="Normal Quantiles"; run; This results in the following plot: 132 Analysis Guide Midterm Figure 21.1.4. Q-Q Plot of the Raw Data The Q-Q plots in Figure 21.1.4 tell us what we already know: The raw data is not normal, and does not have equal variances. The ANOVA test is not super robust to highly skewed, long tailed data, and it relies entirely on equal variances, so we absolutely cannot use the raw data 133 Analysis Guide Midterm Transformed Data Analysis Now we will perform a log transformation on the data and see if that helps it meet our assumptions better. To do a log transformation, we will employ the following SAS code: We will begin our analysis of the Code 21.5. Logging of Raw Data Using SAS data LogEduData; set EduData; LogIncome=log(Income2005); run; transformed data with a scatter plot, produced with the following SAS code: This results in the following Code 21.6. Scatterplot of Logged Data Using SAS proc sgplot data=LogEduData; scatter x=educ y=LogIncome; run; plot: Figure 21.1.5. Scatter Plot of the Log-Transformed Data As we can see in Figure 21.1.5, the groups have a much more similar size, suggesting similar variances, and the heavy part of the scatter plot is closer to the center, in between the outliers, which tells us the log transformation may have done a good deal towards normalizing our data. We can examine this further using Box plots. To produce Box plots of the transformed data, the following SAS code was used: This gives us the Code 21.7. Boxplot of Logged Data Using SAS proc sgplot data=LogEduData; vbox LogIncome / category=educ dataskin=matte ; xaxis display=(noline noticks); yaxis display=(noline noticks ) grid; run; following plot: 134 Analysis Guide Midterm Figure 21.1.6. Box Plot of the Log-Transformed Data Figure 21.1.6 gives us some useful information about our data. We see the boxes and whiskers are of similar size, which tells us the variances are likely homogeneous. Furthermore, the medians and means are near each other, and the boxes are near the center of the distribution, which suggests that the data may be normal. We will examine these two phenomena further with histograms. To produce histograms of the log-transformed data, the following SAS code was used: This results in the following plot: Code 21.8. Histogram of Logged Data Using SAS proc sgpanel data=LogEduData; panelby educ / rows=5 layout=rowlattice; histogram LogIncome; run; Figure 21.1.7. Histogram of the Log-Transformed Data 135 Analysis Guide Midterm From the spread of the histograms in Figure 21.1.7, we see two things. First, the similar width of the histograms confirms that variances are roughly equal. Second, the shape of the histograms, and their location near the center suggests that the data is very nearly normal. We will further examine the normality of the data using Q-Q plots. To produce the Q-Q plots of the transformed data, the following SAS code was used: This results in the Code 21.9. Q-Q of Logged Data Using SAS proc rank data=LogEduData normal=blom out= LogEduQuant; var LogIncome; ranks LogEduQuant; run; proc sgpanel data=LogEduQuant; panelby educ; scatter x=LogEduQuant y=LogIncome ; colaxis label="Normal Quantiles"; run; following plot: Figure 21.1.8. Q-Q Plot of the Log-Transformed Data Examining Figure 21.1.8, we see a confirmation of our beliefs: The log-transformed data, when plotted against normal quantiles, is fairly normal. This means, with the log transformed data, we can reasonably assume normality and homogeneity of variances. 21.2 Complete Analysis We will now perform a complete analysis of our data, using Pure ANOVA. Problem Statement We would like to determine whether or not at least one of the five population distributions (corresponding to different years of education) is different from the rest. Assumptions As seen in Section 21.1, the raw data does not meet the assumption of normality nor of homogeneity of variance. However, in Section 21.1, we proved that after a log transformation, the data does meet both of these assumptions. The ANOVA test is fairly robust to the slight departure from normality presented by the log transformed data, and the variances are equal. The data is clearly independent, so that assumption is met. Therefore, all assumptions of ANOVA are met by the log transformed data. Hypothesis Definition In this problem, our Null (Reduced Model) Hypothesis, H0 , is that all the groups have the same distribution and our Alternative (Full Model) Hypothesis, H1 is that the distributions are different. Mathemati136 Analysis Guide Midterm cally, that is written as: H0 :mediangrand H1 :median<12 mediangrand median12 mediangrand median13−15 mediangrand median16 mediangrand median>16 (21.2.1) (21.2.2) We will consider our confidence level, α to be 0.05 F Statistic To conduct this hypothesis test, the following SAS code was used: This results in the following ANOVA Code 21.10. ANOVA Test Using SAS proc glm data = LogEduData; class educ; model LogIncome = educ; run; Output: Figure 21.2.1. ANOVA Table Figure 21.2.1 tells us what our F statistic is. We see that F = 62.87 (21.2.3) P-value Figure 21.2.1 also tells us our p-value. In this case, (21.2.4) p < .0001 Hypothesis Assessment In this scenario, we have that p < .0001 < α = .05 and therefore we reject the null hypothesis. Conclusion There is substantial evidence (p < 0.0001) that at least one of the distributions is different from the others. To further examine this, we will see if the distribution varies within similar levels of schooling. We will compare <12 and 12 years of school, 12 and 13-15 years of school, 13-15 and 16 years of school, and 16 and >16 years of school. To do this, we will compare medians, using the following SAS code: This results Code 21.11. Comparison of distributions using SAS proc sort data=LogEduData; by educ; run; proc means data = LogEduData by educ; var LogIncome; run; in the following Table: 137 median order=data; Analysis Guide Midterm Table 21.1. Comparison of Logged Means Education µ <12 12 13-15 16 >16 9.9 10.22 10.39 10.79 10.89 From Table 21.1, we can calculate the differences of the means for our log transformed groups, and see how much the distributions differ, shown in the following table: Table 21.2. Comparison of Distributions Pair Difference Multiplicative Effect (eµ1 −µ2 ) % Increase <12 and 12 12 and 13-15 13-15 and 16 16 and >16 0.32 0.17 .4 .1 1.38 1.19 1.49 1.11 38 19 49 11 Table 21.2 shows us how many times greater the distribution of the income of the larger education in each pair is than the lower education level. Scope of Inference As this was a random sample, we can make inferences about the population, however, we cannot make causal inferences, as this was not a randomized experiment. That means, we can say that in general, people with X years of education make Y many times as people with Z years of education, but we cannot say it is due to the education itself. 21.3 Extra Values The extra values were produced with the same code as in Section 28.1. They can be found in Figure 21.2.1, and in the figure below: Figure 21.3.1. Extra Values Value of R2 Figure 21.3.1 tells us R2 is 0.0888 Mean Square Error and Degrees of Freedom The Mean Square Error, shown in Figure 21.2.1, is 2232.12, with 2579 degrees of freedom ANOVA in R! Here is the R code and output to do ANOVA in R on the log transformed data: 138 Analysis Guide Midterm Code 21.12. ANOVA in R 1 2 3 # #################### Anova in R ###################### edudata <- read.csv(file=’data/ex0525.csv’, header=TRUE , sep = ",") edudata$logincome <- log(edudata$Income2005) 4 5 6 7 # http://www.sthda.com/english/wiki/one -way -anova -test -in -r anovatest <- aov(logincome~Educ ,data =edudata) summary(anovatest) 8 9 # ######################## Results ##################### 10 11 12 13 Df Sum Sq Mean Sq F value Pr(>F) Educ 4 217.7 54.41 Residuals 2579 2232.1 0.87 62.87 <2e-16 *** 139 Chapter 22 Problem 2: Build Your Own Anova! In this section we will be building an ANOVA table to determine whether or not the distribution of income of people with > 16 years is different than the distribution of income of people with exactly 16 years of education. To build this ANOVA table, we need two preliminary ANOVA analyses. First, is the ANOVA analysis seen in Section 21.2. This has the null hypothesis that all the distributions are the same, and the alternative hypothesis that the distributions differ. Next, we build a second ANOVA table, which will have a null hypothesis that all the distributions are the same, and an alternative hypothesis that all the distributions are different, except the group with 16 years and the group with >16 years are still the same. This is done by grouping the two into one group, with the following SAS code: Next, to compute important Code 22.1. Regrouping data using SAS data EduGroupData; set LogEduData; Others = educ; if educ eq "16" educ = ">16" then Others="a";run; parameters, an ANOVA test is conducted on the grouped, logged, data, with the following bit of code: This Code 22.2. Secondary ANOVA using SAS proc glm data = EduGroupData; class Others; model LogIncome = Others; run; results in the following intermediate ANOVA table: Figure 22.0.1. Grouped ANOVA Table 22.1 Building the Extra Sum of Squares Anova Table Using the data from 22.0.1 and the data from 21.2.1, we can make our own ANOVA table, which has a null hypothesis that all the distributions different and (except 16 and >16, which are the same), and an alternative hypothesis that all the distributions are different. Since both hypotheses have the same prediction about the data for <12, 12, and 13-15, the null hypothesis of our custom-made ANOVA table is that 16 and >16 have the same distribution, and the alternative is that they have different distributions. We will now construct our new, extra sum of squares ANOVA table. First, for our full model (the "Error" row in the ANOVA table), we will use the full model (alternative hypothesis, or the "Error" row), from Figure 21.2.1. This represents our alternative hypothesis, where the distribution of 16 and >16 are different. Next, we will construct our reduced model (The "Total" row in the ANOVA table) using the full model (alternative hypothesis, or the "Error") from 22.0.1. This represents our null hypothesis, where 16 and >16 have the same distribution. To generate our Model, or Extra Sum of Squares, which will allow us to find our F statistic and p value, we need to take a couple of steps. To determine the number of degrees of freedom of our model, we subtract the number of degrees of freedom from the Error row from the number of degrees of freedom of the Total row. To calculate the extra sum of squares, we subtract the residual sum of squares of the full model (error) from the residual sum of squares of the reduced model (total). Then, to find the mean square, we divide the extra sum of squares by the number of degrees of freedom in our model. Our F statistic is then produced by normalizing the Extra Sum of Squares, dividing it by the Mean Square Error (in the Error row). To get a p value from the F statistic, 140 Analysis Guide Midterm we examine an F distribution with degrees of freedom = displayed in the following table: dfmodel dff ull . The results of these computations are Table 22.1. Homemade ANOVA Table Source DF Model (Extra SS) 1 Error (Full) 2579 Total (Reduced) 2580 22.2 Sum of Squares Mean Square F Value Pr>F 1.98 2232.12 2234.1 1.98 .86 2.3 0.129 Complete Analysis Problem Statement We would like to determine whether or not people with a college degree or a graduate degree have different distributions of incomes. Assumptions There are three assumptions of ANOVA: normality, homogeneity of variance, and independence. We have shown, in Section 21.1 that while the raw data does not meet the first two assumptions, the log transformed data does. Both the transformed and raw data meet the assumption of independence. We will proceed with our ANOVA test. Hypothesis Definition Our null hypothesis states that the distribution of the >16 and 16 groups is the same, and our alternative hypothesis states that the distribution of the >16 and 16 groups is different. We proved this in Section 22.1, and this is written mathematically as: H0 :median<12 median12 median13−15 median16, >16 H1 :median<12 median12 median13−15 median16 median16, >16 median>16 (22.2.1) (22.2.2) OR: H0 :median16 = median>16 (22.2.3) H1 :median16 , median>16 (22.2.4) We will consider our confidence level, α to be 0.05 F Statistic The F statistic is calculated with the following equation: SS SS extra extra F= DFextra σf2ˆull = DFextra MSE (22.2.5) The results of this calculation can be seen in Table 22.1, we have that F = 2.3 This is a small F statistic, which is likely indicative of weak evidence. P-value The P value is calculated using F, the Extra degrees of freedom, and the Full (Error) degrees of freedom. Using the values calculated in Table 22.1, we have that p = 0.129 Hypothesis Assessment At a confidence level α = 0.05, we have that p = .0129 > α = .05. Therefore, we cannot reject the null hypothesis. Conclusion There is not enough evidence to suggest that the distribution of income of people with a college only (16 years) is different from the distribution of income of people with a postgraduate education (>16 years). Scope of Inference It is not necessary to write a scope of inference as we did not reject the null hypothesis, however this is a random sample, so we can make inferences about the population as whole, but we cannot infer causality, as this was not a random experiment. 141 Analysis Guide 22.3 Midterm Degrees of Freedom and Comparison to T-Test This test had 2579 degrees of freedom (as seen in Table 22.1). This is a lot more than than the t test, which is a lot more than the number of degrees of freedom in the t test. Therefore, this ANOVA test has more power than the t test!. 142 Chapter 23 Problem 3: Nonhomogeneous Standard Deviations 23.1 Complete Analysis Problem Statement We would like to determine whether or not at least one of the five population distributions (corresponding to different years of education) is different from the rest. Assumptions As seen in Section 21.1, the raw data does not meet the assumption of normality nor of homogeneity of variance. However, in Section 21.1, we proved that after a log transformation, the data is at least normal. The ANOVA test is fairly robust to the slight departure from normality presented by the log transformed data, so we can safely assume normality. However, we cannot assume homogeneity variances. Therefore, pure ANOVA is not appropriate. Since the data is to some extent normal, we should try and use a parametric test, as they have more power in general than their nonparametric analogs. Therefore, the Kruskal-Wallis test is not the most appropriate test. We will instad use Welch’s ANOVA Test, which assumes normality but does not assume homogeneity of variance, on the log transformed data. We can assume the data is independent. Hypothesis Definition In this problem, our Null (Reduced Model) Hypothesis, H0 , is that all the groups have the same distribution and our Alternative (Full Model) Hypothesis, H1 is that the distributions are different. Mathematically, that is written as: H0 :mediangrand H1 :median<12 mediangrand median12 mediangrand median13−15 mediangrand median16 mediangrand median>16 (23.1.1) (23.1.2) We will consider our confidence level, α to be 0.05 F Statistic To conduct this hypothesis test, the following SAS code was used: This results in the following table: Code 23.1. Welch’s ANOVA in SAS proc glm data = LogEduData; class educ; model LogIncome = educ; means educ / welch; run; Figure 23.1.1. Welch’s ANOVA Table From Figure 23.1.1, we have that F = 56.59. This is a pretty large F statistic, which means that we probably have some good evidence in favor of the alternative hypothesis. 143 Analysis Guide Midterm P-value Figure 23.1.1 Also tells us that the p-value associated with the F statistic, which is given as p < 0.0001. Hypothesis Assessment We have that p < 0.0001 < α = .05 and therefore we Reject the null hypothesis Conclusion There is convincing evidence (p < 0.0001) that at least one of the distributions is different from the others. Scope of Inference As this was a random sample, we can make inferences about the population, however, we cannot make causal inferences, as this was not a randomized experiment. That means, we can say that in general, people with X years of education make Y many times as people with Z years of education, but we cannot say it is due to the education itself. 144 Chapter 24 unit 5 lecture slides More slides 145 10/13/2018 ANOVA 1. Make a Scatterplot of the data in the table below. “Level” is the Explanatory Variable (X=1, 2, or 3). UNIT 5: Chapter 5 Level i=1 Level i=2 Level i=3 Y1|X=i 3 10 20 Y2|X=i 5 12 22 Y3|X=i 7 14 24 ANOVA 2. Find the Grand Mean … this is the mean of all the Ys together … regardless of Level. Pure ANOVA ANOVA 4. Now we need to find the Sum of the Squared Residuals for the Equal Means Model. 1. Make a Scatterplot of the data in the table below. “Level” is the Explanatory Variable (X=1, 2, or 3). Level i=1 Level i=2 Level i=3 Y1|X=i 3 10 20 Y2|X=i 5 12 22 Y3|X=i 7 14 24 5 12 22 2. Find the Grand Mean … this is the mean of the sample means. If the sample size is the same in each group, then this is the mean of all the Ys together … regardless of Level. Level i=1 Level i=2 Level i=3 Y1|X=i 3 10 20 Y2|X=i 5 12 22 Y3|X=i 7 14 24 5 12 22 Level i=1 Level i=2 Level i=3 Level i=1 Level i=2 Level i=3 6. Compare the Total Sum of Squares for each model. Which do you think “fits” better? Pure ANOVA 4. Now we need to find the Sum of the Squared Residuals for the Equal Means Model. Level i=1 Level i=2 Level i=3 Y1|X=i 3 10 20 Y2|X=i 5 12 22 Y3|X=i 7 14 24 5 12 22 Pure ANOVA 4. Now we need to find the Sum of the Squared Residuals for the Equal Means Model. Level i=1 Level i=2 Level i=3 Y1|X=i 3 10 20 Y2|X=i 5 12 22 Y3|X=i 7 14 24 5 12 22 Level i=1 Level i=2 Level i=3 Level i=1 Level i=2 Level i=3 (3-13)2 = 100 (10-13)2 = 9 49 (3-13)2 = 100 9 49 (5-13)2 = 64 1 81 64 1 81 36 1 121 36 1 121 Level i=1 Level i=2 Level i=3 6. Compare the Total Sum of Squares for each model. Which do you think “fits” better? Level i=1 Level i=2 Level i=3 (3-5)2 = 4 (10-12)2 = 4 (20-22)2 = 4 0 0 0 4 4 4 6. Compare the Total Sum of Squares for each model. Which do you think “fits” better? 1 10/13/2018 Pure ANOVA Sum of Squares in ANOVA Between group variation (top row) Total variation (bottom row) Level i=1 Level i=2 Level i=3 Y1|X=i 3 10 20 Y2|X=i 5 12 22 Y3|X=i 7 14 24 7. Now we would like to make an ANOVA table to test the alternative hypothesis! Formally write the Ho and Ha and fill in the table. df Within group variation (middle row) SS MS F Pr > F Model / Extra SS *To compute the sum of squares column for the ANOVA table, square each distance (lines in black) and then add. The sum of squared* distances (black lines) for left two graphs = the sum of squared distances (black lines) for the right graph. Error / Residual/Full Model Total (Reduced) Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full *Each distance squared for the top left graph is multiplied by the number in each group. Pure ANOVA Pure ANOVA 7. Now we would like to make an ANOVA table to test the alternative hypothesis! Formally write the Ho and Ha and fill in the table. Ho : µ 1 = µ 2 = µ 3 Ha: At least 1 pair are different Formally write the Ho and Ha and fill in the table. (Equal Means Model µ µ µ) (Separate Means Model µ1 µ2 µ3) df SS MS F Ho : µ 1 = µ 2 = µ 3 Ha: At least 1 pair are different Pr > F Model / Extra SS Error / Residual/Full Model 6 24 Total (Reduced) 8 462 7. Now we would like to make an ANOVA table to test the alternative hypothesis! 4 Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full df SS Model / Extra SS 8-6=2 462-24=438 Error / Residual/Full Model 6 24 Total (Reduced) 8 462 MS F Pr > F 4 Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full Pure ANOVA Pure ANOVA 7. Now we would like to make an ANOVA table to test the alternative hypothesis! Formally write the Ho and Ha and fill in the table. Ho : µ 1 = µ 2 = µ 3 Ha: At least 1 pair are different (Equal Means Model µ µ µ) (Separate Means Model µ1 µ2 µ3) 7. Now we would like to make an ANOVA table to test the alternative hypothesis! Formally write the Ho and Ha and fill in the table. (Equal Means Model µ µ µ) (Separate Means Model µ1 µ2 µ3) F Ho : µ 1 = µ 2 = µ 3 Ha: At least 1 pair are different Pr > F (Equal Means Model µ µ µ) (Separate Means Model µ1 µ2 µ3) df SS MS df SS MS F Model / Extra SS 2 438 438/2=219 Model / Extra SS 2 438 219 219/4=54.75 Error / Residual/Full Model 6 24 4 Error / Residual/Full Model 6 24 4 Total (Reduced) 8 462 Total (Reduced) 8 462 Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full Pr > F Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full 2 10/13/2018 F -Test of Different Means … Pure ANOVA Ho: µ1= µ2 = µ3 Ha: At least 1 pair are different 7. Now we would like to make an ANOVA table to test the alternative hypothesis! (Equal Means Model) (Separate Means Model) Formally write the Ho and Ha and fill in the table. Ho : µ 1 = µ 2 = µ 3 (Equal Means Model µ µ µ) Ha: At least 1 pair are different (Separate Means Model µ1 µ2 µ3) df SS MS F Pr > F Model / Extra SS 2 438 219 54.75 .0001 Error / Residual/Full Model 6 24 4 Total (Reduced) 8 462 Extra Sum of Squares = Residual Sum of Squares Reduced – Residual Sum of Squares Full 6 Steps for ANOVA F Test (diff means)! 1. Ho: µ1= µ2 = µ3 Ha: At least 1 pair are different 2. Critical value: You can skip this step for ANOVA. 3. F statistic = 54.75 4. P-value = .0001 5. Reject Ho. 6. The evidence suggests that at least 1 pair of the group means are different. (P-value < .0001 from an ANOVA.) (Equal Means Model) (Separate Means Model) R-Squared! R2 F-Distribution Coefficient of Variation R =correlation coefficient = coefficient of determination 3 10/13/2018 ANOVA: Assumptions and Robustness 1. Normality: Similar to t-tools hypothesis testing, ANOVA is robust to this assumption. Extremely longtailed distributions (outliers) or skewed distributions, coupled with different sample sizes (especially when the sample sizes are small) present the only serious distributional problems. 2. Equal Standard Deviations: This assumption is crucial, paramount, and VERY important. 3. The assumptions of independence within and across groups are critical. If lacking, different analysis should be attempted. More on Constant SD Samples drawn from Normal Distributions • Same visual checks as with t-tools, just for more groups. – Histograms – Q-Q plots Levene’s Test (Median) Ho: σ1= σ2 Ha: σ1≠ σ2 95% confidence interval accuracy with different sample sizes and standard deviations for three groups. But … proc ttest does not have Levene’s Test!!! Proc GLM Has Levene’s Test Check of Assumptions: Constant SD There is some visual evidence against equal standard deviations. The BrownForsythe test was used as secondary evidence and does not provide significant evidence against equal standard deviations. (p-value = .2558) 4 10/13/2018 Archeology in New Mexico An archeological dig in New Mexico yielded four sites with lots of artifacts. The depth (cm) that each artifact was found was recorded along with which site it was found in. The researcher has reason to believe that sites 1 and 4 and sites 2 and 3 may be similar in age. In theory, the deeper the find, the older the village. Is there any evidence that sites 1 and 4 have a mean depth that is different than the mean depth of artifacts from sites 2 and 3? Archeology Example Assumptions: Normality Archaeology Example Depth 93 120 65 105 115 82 99 87 100 90 78 95 93 88 110 Site 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Depth 85 45 80 28 75 70 65 55 50 40 Site 2 2 2 2 2 2 2 2 2 2 Depth 100 75 65 40 73 65 50 30 45 50 45 55 Site 3 3 3 3 3 3 3 3 3 3 3 3 Depth 96 58 95 90 65 80 85 95 82 Site 4 4 4 4 4 4 4 4 4 Archeology Example Assumptions: Homogeneity (Equal SD) Histograms will be helpful as well! Archeology Example Assumption: Independence The discovered artifacts associated with the depths were randomly selected from the log (book of recordings … not logarithms!) of discoveries. Since the artifacts and, thus, the depths are associated with completely different sites, it is assumed that the data are independent between sites. Question of Interest: 1. Are any of the means different? 2. Are the means of sites 1 and 4 different? 3. Are the means of sites 2 and 3 different? 4. Satisfactory results of questions 1 and 2 will allow us to ask the third question: are sites 1 and 4 different than 2 and 3? 5 10/13/2018 Are sites 1 and 4 different from 2 and 3? *Assumes ANOVA assumptions are met Stop: Insufficient evidence that any means are different Perform regular ANOVA to test if any of the means are different from the rest. Reduced Model Ho: µ µ µ µ Full Model Ha: µ1 µ2 µ3 µ4 BYO ANOVA to test if the means of 2 and 3 are different, given at least one pair is different. Reduced Model Ho: µ1 µ0 µ0 µ4 Full Model Ha: µ1 µ2 µ3 µ4 Reject Ho in favor of Ha: µ1 µ2 µ3 µ4? Reject Ho in favor of Ha: µ1 µ2 µ3 µ4? no BYO ANOVA to test if the means of 1 and 4 are different, given at least one pair is different. Reduced Model Ho: µ0 µ2 µ3 µ0 Full Model Ha: µ1 µ2 µ3 µ4 yes yes no yes Stop: Groups 1 and 4 are different and should not be treated as having the same means, as the QoI suggests. Stop: Groups 2 and 3 are different and should not be treated as having the same means, as the QoI suggests. Stop: Evidence does NOT support the claim in QoI Reject Ho in favor of Ha: µa µb µb µa ? no Question of Interest: 2. Are the means of sites 1 and 4 different? Compare this model against equal means model (µ µ µ µ) *Recode the variables into three groups: 2, o 3, and 1/4 combined and a perform ANOVA to get the first (Ho) Reduced: µ µ µ µ table. (H ) Full*: µ µ µ µ (H ) Reduced Model: µo µ2 µ3 µo (H ) Full Model: µ1 µ2 µ3 µ4 2 3 Source DF SS MS F Pr>F Model (Full) 1 780.3 2.86 .098 Error (From Full) 42 11464.6 Total (From Reduced*) 43 12244.9 780.3 Compare this model against equal means model (µ µ µ µ) (Ho) Reduced: µ µ µ µ (Ha) Full: µ1 µ2 µ3 µ4 o 273.0 There is not enough evidence to suggest (alpha = .05, p-value = .098) that site 1 and site 4 have different mean depths. Question of Interest: (try it!) 3. Are the means of sites 2 and 3 different? *Recode the variables into three groups: 1, 4, and 2/3 combined and perform ANOVA to get the first table. (Ho) Reduced Model: µ1 µo µo µ4 (Ha)Full Model: µ 1 µ 2 µ3 µ4 (Ho) Reduced: µ µ µ µ (Ha) Full*: µ1 µo µo µ4 Source DF SS (Ho) Reduced: µ µ µ µ (Ha) Full: µ1 µ2 µ3 µ4 MS Model (Full) Error (From Full) 42 11464.6 Total (From Reduced) 43 11477.7 273 F Pr>F There is evidence to suggest that at the alpha = .05 level of significance (pvalue < .0001) that at least 2 of the sites have different mean depths. Stop: Evidence does support the claim in QoI yes no o (Ho) Reduced Model: µ µ µ µ (Ha) Full Model: µ1 µ2 µ3 µ4 Perform ANOVA to test if the means of 1 and 4, when taken together are different than means 2 and 3, when also taken together. Reduced Model Ho: µ µ µ µ Full Model Ha: µa µb µb µa Reject Ho in favor of Ha: µ1 µ2 µ3 µ4? a First Ask: Is there reason to believe any of them are different? The reduced and full models are associated with Ho and Ha, respectively, although they are not exactly equal to the hypotheses. Question of Interest: (try it!) 3. Are the means of sites 2 and 3 different? *Recode the variables into three groups: 1, 1 o o 4 o 4, and 2/3 combined and 1 2 3 4 a perform ANOVA (Ho) Reduced: µ µ µ µ (Ho) Reduced: µ µ µ µ to get the first table. (Ha) Full: µ1 µ2 µ3 µ4 (Ha) Full*: µ1 µo µo µ4 (H ) Reduced Model: µ µ µ µ (H ) Full Model: µ µ µ µ Source DF SS MS F Pr>F Model (Full) Error (From Full) Total (From Reduced*) Question of Interest: (try it!) 3. Are the means of sites 2 and 3 different? *Recode the variables into three groups: 1, 4, and 2/3 combined and perform ANOVA to get the first table. (Ho) Reduced Model: µ1 µo µo µ4 (Ha) Full Model: µ1 µ2 µ3 µ4 (Ho) Reduced: µ µ µ µ (Ha) Full*: µ1 µo µo µ4 (Ho) Reduced: µ µ µ µ (Ha) Full: µ1 µ2 µ3 µ4 Source DF SS MS F Pr>F Model (Full) 1 13.1 13.1 .048 .828 Error (From Full) 42 11464.6 273 Total (From Reduced) 43 11477.7 There is not enough evidence to suggest (alpha = .05, p-value = .828) that site 2 and site 3 have different mean depths. 6 10/13/2018 Question of Interest: 4. Are sites 1 and 4 different than 2 and 3? *Recode the variables into two groups 1/4 and 2/3 and perform ANOVA to get the table. A Small Example (Ho) Reduced: µ µ µ µ (Ha) Full: µb µa µa µb There is sufficient evidence to suggest (alpha = .05, p-value < .0001) that sites 1 and 4 have different mean depths than sites 2 and 3. Normality Assumption Homogeneity of Variance Assumption There is some (weak) evidence in support of these data coming from distributions with different standard deviations. If the standard deviation assumption and normality assumption are both violated, what should we do? There is strong evidence against these data coming from a normal distribution and the sample size is small. ANOVA? WELCH’S ANOVA? So …. NONPARAMETRIC!!!! Kruskal-Wallis Test There is not sufficient evidence at the alpha = .05 level of significance (p-value = .3766 from Kruskal-Wallis Test) to suggest that at least two of the medians are different. Notice that each test failed to reject their respective Ho. The point isn’t so much that one test will reject when the other will fail to reject. We must remember that as statisticians, we don’t personally favor one outcome over the other. We just want the appropriate test: the one with the most power. Kruskal-Wallis Test is the appropriate test here. 7 10/13/2018 Another Analysis!!!! Normality Assumption … There is strong evidence in favor of these data coming from a normal distribution. We will proceed under this assumption. Assumptions and Analysis: There is strong evidence in support of these data coming from distributions with different standard deviations. We will proceed under this assumption and run the Welch’s ANOVA. Regular ANOVA: There is sufficient evidence at the alpha = .05 level of significance (p-value = .0201 from Welch’s ANOVA) to suggest that at least two of the means are different. However, remember caveat to any different SD’s approach. Fixed or random effects Fixed Effects vs. Random Effects Quick answer: • Do your groupings exhaust the data (e.g., data on four different machines and there are only four machines)? Fixed Effects! Use Proc GLM in SAS. • Are your groupings a random sample of a larger population that could have been chosen to be a group (e.g., data on four different machines that were chosen from a random sample of 100 machines)? Random Effects! Use Proc Mixed in SAS. APPENDIX Measured the amount of liquid in twenty randomly selected cans of Coke and twenty randomly selected cans of Diet Coke at a regional bottling company. Coke and Diet Coke are bottled using different types of machines. Scenario 1: There is only one machine of each type. Fixed Effects Scenario 2: There are several of each type of machine. The Coke samples all came from the same Coke bottling machine, and the Diet Coke samples all came from the same Diet Coke machine. Random effects 8 10/13/2018 MSE vs. Variance in each group Examples Another example! 5 different sports were analyzed to see if the average height of basketball players was greater than the average of all the other sports. We could, of course, compare each pairwise grouping of sports, but that would result in 4 tests. This would take a lot of time, and those tests would each have less power since they don’t use all the data. Let’s use ANOVA similarly to how we did in prior problems. 1. Make a side by side box plot of the data. 2. Run a basic ANOVA to test for any pairwise difference of means. Check the assumptions here, but no need to address them after this. 3. Test the model that keeps basketball by itself but groups the other sports as “others.” 4. Use the previous two models to conduct an extra sum of squares FTest: Ho: Reduced Model: µB µO µO µO µO Ha: Full Model: µB µF µSoc µSwim µT 5. Depending on the results of this test, test to see if there is evidence that basketball has a different mean than each of the sports. (Equivalent to testing basketball versus the others.) Ho: Reduced Model: µO µO µO µO Ha: Full Model: µB µO µO µO µO µO 6. Make sure and provide written conclusions for questions 2,3,4 and 5. 9 10/13/2018 First … Plot the Data! Plot the Data cont. Normality: We have very small sample sizes here. There is not a lot of evidence against normality for each group, although there is not a lot of evidence to begin with. We will proceed with caution under the assumption of normal distributions for each sport. Homogeneity of Variance: Judging from the box plots, there is some visual evidence against equal standard deviations, although the sample size is still small. A secondary test would be nice to lean on here. We will assume the observations are independent both between and within groups. Brown and Forsythe Test for Equality of Variance. 1 Way ANOVA Ho: µBasketball = µFootball= µSoccer = µSwim = µTennis Ha: At least one pair of means is different. There is some visual evidence against equal standard deviations between sports. The Brown and Forsythe test was used as secondary evidence and does not provide significant evidence against equal standard deviations. (pvalue = .9672) There is strong evidence to suggest that the at least one of the sports has a mean height that is different than the others (p-value < .0001 from an ANOVA). Ho: µBasketball = µFootball= µSoccer = µSwim = µTennis Ha: At least one pair of means are different. F-TEST Ho: The Others are equal. (Including Basketball) Same Test as last slide …. F-TEST Different Notation Ho: Reduced Model: µB µO µO µO µO Ho: Reduced Model: µ µ µ µ µ Ha: Full Model: µB µF µSoc µSwim µT Ha: The Others are different (Including Basketball) Ho: µBasketball = µFootball= µSoccer = µSwim = µTennis Ha: µBasketball is different than the Others. Fail to Reject Ho There is not sufficient evidence at the alpha = .05 level of significance (p-value = 0.5375) to suggest that the mean heights of non-basketball sports are not equal. Therefore we will proceed as if they are equal. Ha: Full Model: µB µF µSoc µSwim µT Ho: Reduced Model: µ µ µ µ µ Ha: Full Model: µB µO µO µO µO Fail to Reject Ho There is not sufficient evidence at the alpha = .05 level of significance (p-value = 0.5375) to suggest that the mean heights of non-basketball sports are not equal. Therefore we will proceed as if they are equal. 10 10/13/2018 µB µO µO µO µO µB µF µSoc µSwim µT Ho: µBasketball = µOthers Ha: µBasketball ≠ µOthers F-TEST: Another Look Ho: Reduced Model: µB µO µO µO µO Ha: Full Model: µB µF µSoc µSwim µT Source Since we are proceeding under the assumption that the mean heights of the other sports (besides basketball) are equal, we can test whether basketball has a mean height different than the other sports by testing: DF SS MS F Pr > F Model 3 11.63 3.87 .74 0.5375 Error 27 141.56 5.24 Corrected Total 30 153.19 There is strong evidence at the alpha = .05 level of significance (pvalue < .0001) that supports the claim that the mean height of basketball players is different than that of the other 4 sports. Resources www.itl.nist.gov/div898/handbook/prc/section4/prc433.htm Spock Example Spock Trial The Raw Data • 1968: Dr. Ben Spock was accused of conspiracy to violate the Selective Service Act by encouraging young men to resist being drafted into military service for Vietnam. • Jury Selection: A “venire” of 30 potential jurors is selected at random from a list of 300 names that were previously selected at random from citizens of Boston. • A jury is then selected NOT at random by the attorneys trying the case. • For this case, the venire consisted of only one woman, who was let go by the prosecution, thus resulting in an all male jury. • There was reason to believe that women were more sympathetic to Dr. Spock’s actions due to his popular child rearing books. • The defense argued that the judge in this case had a history of venires that underrepresented women, which is contrary to the law. • Let’s see if there is any evidence for this claim! 11 10/13/2018 Comparing Two Means From Many Groups. Judge N Xbar Sd Spock 9 14.6 5.04 A 5 34.1 11.94 B 6 33.6 6.58 C 9 29.1 4.59 D 2 27.0 3.81 E 6 27.0 9.01 F 9 26.8 5.97 Ho: µS = µF Ha: µS ≠ µF Spock Data Steps With 2 groups estimating the pooled SD. Question: Suppose we wish to test if the “S” judge’s venires are different from the “F” judge’s. With all 7 groups estimating the pooled SD, bigger ‘n’ greater df! More POWER!!! sp = 6.91 P-value = .0006 Two Judge Analysis w/ t-Tools Two Judge Analysis w/ Several-Groups From PROC TTEST: Estimated Diff = Sp = Pooled Std. Error = t-Statistic = Deg. of freedom = Statistical Conclusion: We find that there is substantial evidence that the difference in the mean percentage of females on judge S and judge F venires is not equal to zero. -12.1778 5.5234 2.6038 -4.68 16 Deg. of freedom = 46 – 7 = 39 Estimated Diff = -12.1778 Sp = 5.5234 Pooled Std. Error = 2.6038 t-Statistic = -4.68 Deg. of freedom = 16 Two Judge Analysis: Conclusion Question: Suppose we wish to test if the “S” judge’s venires are different from the “F” judge’s. Spock Trial QOI 2 The defense argued that the judge in this case had a history of venires that underrepresented women, which is contrary to the law. Answer: There is evidence that the mean of the two groups is different. QOI2: Is the percent of women on recent venires of Spock’s judge (which we will call S) significantly lower than those of 6 other judges (which we notate A to F)? • There are two key questions: • 1. Is there evidence that women are underrepresented on S’s venires relative to A to F’s? 2. Is there evidence of a difference in women’s representation on A to F’s venires? •The question of interest is addressed by 1 •The strength of the result in 1 would be substantially diminished if 2 is true 12 10/13/2018 Step 1: Compare Judges A - F Spock: The Strategy Ho: All “other” means are equal (A, B, C, D, E, F) Ha: At least 2 “other” means are different (A, B, C, D, E, F) But … Let’s use all the data to estimate the pooled standard deviation! Reduced Model: µs µo µo µo µo µo µo Full Model: µs µA µB µC µD µE µF Different Models in SAS At Least 2 are different (S, A, B, … F) Different Models in SAS µs µA µB µC µD µE µF At Least 2 are different (S, A, B, … F) µs µA µB µC µD µE µF Spock is different than the Others µs µo µo µo µo µo µo Spock is different than the Others µs µo µo µo µo µo µo Comparing Two Models: Both are not Equal Means Model At least 2 are different (Spock, A, B, C … F) µs µA µB µC µD µE µF SAS (proc glm) compares models to the equal means model. When you run proc glm, it always makes the “Corrected Total Row” the equal means model. However, we can build our own ANOVA table (BYOA) to compare two models, both of which are not the equal means model. To do this we will need to identify the “full” model and the “reduced” model. The “full” model will be the model with the most parameters (means) in it while the “reduced model” will have fewer parameters. (Note that the equal means model (with one parameter) is the most reduced model you can have.) Spock is different than others µs µo µo µo µo µo µo F-TEST: Another Look Ho: µA, µB, µC …. µF are Equal Ha: At least 2 are different (A,B,C …F) Extra Sum of Squares Test / BYOA Reduced : µs µo µo µo µo µo µo Source DF SS MS F Full: µs µA µB µC µD µE µF Pr > F Model Separate (Full Model) Error Means Model Corrected Total Equal Means Model (Reduced Model) Full Reduced Source DF SS MS F Pr > F Model 5 326.5 65.29 1.37 0.26 Error 39 1864.4 47.81 Corrected Total 44 2190.9 13 10/13/2018 EXTRA SUMS OF SQUARES F TEST Step 1 Complete! Ho: All means are equal (Spock,A,B,C…,F) F-TEST Ha: At least 2 are different (Spock,A,B,….F) Ho: µA – µF are Equal Ha: At least 2 are different (A,B, .. F) There is not sufficient evidence to suggest that the mean percent of women on judge’s A-F venires are different from one another (p-value = .26 from an ANOVA). Therefore, we will now move on to Step 2 and compare Spock’s judge’s mean to the single mean that will represent the other judges. F-TEST: Another Look Fail to Reject Ho Ho: Spock is equal to Others Ha: Spock is diff from Others There is not sufficient evidence at the alpha = .05 level of significance (p-value = 0.26) to suggest that the means are not equal. Therefore, we will proceed as if they are equal. Ho: µA, µB, µC …. µF are Equal Ha: At least 2 are different (A,B,C …F) Source DF SS MS F Pr > F Model 5 326.5 65.29 1.37 0.26 Error 39 1864.4 47.81 Corrected Total 44 2190.9 Step 2! Since we are proceeding under the assumption that the mean percentage of women in venires of the non-Spock judges are equal, we can test whether the Spock judge has a mean percentage different than the other judges by testing: Ho: Mean of Spock is equal to the mean of the others. Ha: Mean of Spock is different than the mean others. There is strong evidence at the alpha = .05 level of significance (p-value < .0001 from an ANOVA) to support the claim that the mean percentage of women in the Spock judge’s venires is less than that of the other 6 judges and that there is no evidence that the other 6 judges have different mean percentages of women on their venires (pvalue = .26 from an Extra Sum of Squares F Test). Spock’s lawyer has evidence for a mistrial. 14 Part VI Multiple comparisons and post hoc tests 160 Chapter 25 Problem 1: Bonferroni and the Handicap Study The Bonferroni method was used to construct some simultaneous confidence intervals for µ1 − µ2 , µ2 − µ5 and µ3 − µ5 , to see whether there are differences in attitude toward the mobility type of handicaps. The Bonferroni CIs were calculated using the following SAS code: Note that lsmeans and means have the same Code 25.1. Bonferroni in SAS proc glm data = handicap; class handicap; model score = handicap; means handicap / hovtest = bf bon cldiff; lsmeans handicap / pdiff adjust = bon cl; run; results, because we are dealing with balanced data The result of this code is shown below: Figure 25.0.1. Bonferroni Confidence Intervals Another nice way to visualize these confidence intervals is like this: 161 Analysis Guide Midterm Figure 25.0.2. Diffogram of the Bonferroni Confidence Intervals As we see from these two figures, the only statistically significant mean difference was the crutches vs the hearing, which means that the attitude towards the different mobility handicaps is the same (µ1 − µ2 , µ2 − µ5 and µ3 − µ5 are not different) 162 Chapter 26 Multiple Comparison and the Handicap Study To generate all the multiple comparisons, and the half widths, the follwoing SAS code was used: Here we Code 26.1. all the multiple comparisons in SAS proc glm data = handicap; class handicap; model score = handicap; means handicap / tukey bon scheffe LSD Dunnett(’None’); run; see the results of this (a) Bonferroni (b) Tukey (c) Dunnet (d) Scheffe (e) LSD Figure 26.0.1. Half widths of different post hoc analyses in SAS 163 Analysis Guide Midterm We did the same thing in R, with code and output shown below: 164 Analysis Guide Midterm Code 26.2. Multiple comparisons with R 1 2 3 4 5 6 7 8 prob2 <- case0601 # we make none the first group so that dunnetts test behaves prob2$Handicap <-factor(prob2$Handicap ,levels=c(’None ’, ’Amputee ’, ’Crutches ’, ’Hearing ’, ’Wheelchair ’)) aovmodel <- aov(Score ~ Handicap , data=Handi) # Now we can begin our tests # Tukey ’s test tukey <- glht(aovmodel ,linfct=mcp(Handicap="Tukey")) confint(tukey) #Tukey 9 10 11 12 Simultaneous Confidence Intervals 13 14 Multiple Comparisons of Means: Tukey Contrasts 15 16 17 Fit: aov(formula = Score ~ Handicap , data = Handi) 18 19 20 Quantile = 2.8066 95% family -wise confidence level 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Linear Hypotheses: Estimate lwr upr Amputee - None == 0 Crutches - None == 0 Hearing - None == 0 Wheelchair - None == 0 Crutches - Amputee == 0 Hearing - Amputee == 0 Wheelchair - Amputee == 0 Hearing - Crutches == 0 Wheelchair - Crutches == 0 Wheelchair - Hearing == 0 -0.4714 1.0214 -0.8500 0.4429 1.4929 -0.3786 0.9143 -1.8714 -0.5786 1.2929 -2.2037 1.2608 -0.7108 2.7537 -2.5822 0.8822 -1.2894 2.1751 -0.2394 3.2251 -2.1108 1.3537 -0.8179 2.6465 -3.6037 -0.1392 -2.3108 1.1537 -0.4394 3.0251 35 36 37 # Calculated by hand half width = 1.73225 38 39 40 41 # bonferroni ## confint(tukey ,test=adjusted(type="bonferroni")) # bonferroni , we can just apply the bonferroni to whatever # according to the documentation 42 43 Simultaneous Confidence Intervals 44 45 Multiple Comparisons of Means: Tukey Contrasts 46 47 48 Fit: aov(formula = Score ~ Handicap , data = Handi) 49 50 51 Quantile = 2.8057 95% family -wise confidence level 52 53 54 55 56 57 58 59 60 61 62 63 64 65 Linear Hypotheses: Estimate lwr upr Amputee - None == 0 Crutches - None == 0 Hearing - None == 0 Wheelchair - None == 0 Crutches - Amputee == 0 Hearing - Amputee == 0 Wheelchair - Amputee == 0 Hearing - Crutches == 0 Wheelchair - Crutches == 0 Wheelchair - Hearing == 0 -0.4714 1.0214 -0.8500 0.4429 1.4929 -0.3786 0.9143 -1.8714 -0.5786 1.2929 -2.2031 1.2602 -0.7102 2.7531 -2.5817 0.8817 -1.2888 2.1745 -0.2388 3.2245 -2.1102 1.3531 -0.8174 2.6459 -3.6031 -0.1398 -2.3102 1.1531 -0.4388 3.0245 66 67 68 # Calculated by hand half width = 1.73165 69 70 71 72 ## LSD # LSD <- LSD.test(aov(lm(Score ~ Handicap , data=ppp)), "Handicap") # LSD LSD$statistics$LSD # LSD Half int 73 74 75 [1] 1.232618 76 77 78 79 80 81 # Dunnett 165 dunnett <- glht(aovmodel ,linfct=mcp(Handicap="Dunnett")) confint(dunnett) #Dunnett Chapter 27 Comparing groups: Education study 27.1 Assumptions Raw Data Analysis First, we will look at the raw data. To check if the raw data fits the assumptions, we will first look at a scatter plot. The scatter plot of the raw data was produced by the following bit of SAS code: proc sgplot data=EduData; scatter x=educ y=Income2005; run; This results in the following plot: Figure 27.1.1. Scatter Plot of the Raw Data Looking at Figure 27.1.1, we see that the raw data is very heavy in between 0 and 20,000 for all categories, but some groups spread further and wider than others, which suggests the variances may not be equal. The heaviness of the lower end of each group may also suggest a lack of normality. We will examine this further with some Box plots. These were produced using the following chunk of SAS code: proc sgplot data=EduData; vbox Income2005 / category=educ dataskin=matte ; xaxis display=(noline noticks); yaxis display=(noline noticks) grid; run; This results in the following plot: 166 Analysis Guide Midterm Figure 27.1.2. Box Plot of the Raw Data Figure 27.1.2 tells us a lot about our data. We see from the size and shape of the boxes that the variances of our data are by no means homogeneous. Note that there are a lot of outliers while the distribution is heavily weighted towards the bottom, this suggests our data may have departed from normality. We will examine this phenomenaa further using histograms. To produce histograms of the raw data, the following SAS code was used: proc sgpanel data=EduData; panelby educ / rows=5 layout=rowlattice; histogram Income2005; run; This results in the following plot: Figure 27.1.3. Histogram of the Raw Data Figure 27.1.3 confirms our suspicions, the variances of the data are likely unequal, but more importantly, the data is clearly skewed to the right. We will confirm this using Q-Q plots. To produce Q-Q plots of the raw data, the following SAS code was used: /* Normal = blom produces normal quantiles from the data */ /* To find out more, look at the SAS documentation!*/ 167 Analysis Guide Midterm proc rank data=EduData normal=blom out=EduQuant; var Income2005; /* Here we produce the normal quantiles!*/ ranks Edu_Quant; run; proc sgpanel data=EduQuant; panelby educ; scatter x=Edu_Quant y=Income2005 ; colaxis label="Normal Quantiles"; run; This results in the following plot: Figure 27.1.4. Q-Q Plot of the Raw Data The Q-Q plots in Figure 27.1.4 tell us what we already know: The raw data is not normal, and does not have equal variances. The ANOVA test is not super robust to highly skewed, long tailed data, and it relies entirely on equal variances, so we absolutely cannot use the raw data Transformed Data Analysis Now we will perform a log transformation on the data and see if that helps it meet our assumptions better. To do a log transformation, we will employ the following SAS code: data LogEduData; set EduData; LogIncome=log(Income2005); run; We will begin our analysis of the transformed data with a scatter plot, produced with the following SAS code: proc sgplot data=LogEduData; scatter x=educ y=LogIncome; run; This results in the following plot: 168 Analysis Guide Midterm Figure 27.1.5. Scatter Plot of the Log-Transformed Data As we can see in Figure 27.1.5, the groups have a much more similar size, suggesting similar variances, and the heavy part of the scatter plot is closer to the center, in between the outliers, which tells us the log transformation may have done a good deal towards normalizing our data. We can examine this further using Box plots. To produce Box plots of the transformed data, the following SAS code was used: proc sgplot data=LogEduData; vbox LogIncome / category=educ dataskin=matte ; xaxis display=(noline noticks); yaxis display=(noline noticks ) grid; run; This gives us the following plot: Figure 27.1.6. Box Plot of the Log-Transformed Data Figure 27.1.6 gives us some useful information about our data. We see the boxes and whiskers are of similar size, which tells us the variances are likely homogeneous. Furthermore, the medians and means are near each other, and the boxes are near the center of the distribution, which suggests that the data may be normal. We will examine these two phenomena further with histograms. To produce histograms of the log-transformed data, the following SAS code was used: proc sgpanel data=LogEduData; 169 Analysis Guide Midterm panelby educ / rows=5 layout=rowlattice; histogram LogIncome; run; This results in the following plot: Figure 27.1.7. Histogram of the Log-Transformed Data From the spread of the histograms in Figure 27.1.7, we see two things. First, the similar width of the histograms confirms that variances are roughly equal. Second, the shape of the histograms, and their location near the center suggests that the data is very nearly normal. We will further examine the normality of the data using Q-Q plots. To produce the Q-Q plots of the transformed data, the following SAS code was used: proc rank data=LogEduData normal=blom out= LogEduQuant; var LogIncome; ranks LogEduQuant; run; proc sgpanel data=LogEduQuant; panelby educ; scatter x=LogEduQuant y=LogIncome ; colaxis label="Normal Quantiles"; run; This results in the following plot: 170 Analysis Guide Midterm Figure 27.1.8. Q-Q Plot of the Log-Transformed Data Examining the previous figure, we see a confirmation of our beliefs: The log-transformed data, when plotted against normal quantiles, is fairly normal. This means, with the log transformed data, we can reasonably assume normality and homogeneity of variances. We have fulfilled the assumptions of the ANOVA test and now we are ready to go! 171 Chapter 28 selection and execution First, we run an f test to see if any of the means are different! 28.1 ANOVA We will now perform a complete analysis of our data, using Pure ANOVA. Problem Statement We would like to determine whether or not at least one of the five population distributions (corresponding to different years of education) is different from the rest. Assumptions As seen in Section ??, the raw data does not meet the assumption of normality nor of homogeneity of variance. However, in Section 27.1, we proved that after a log transformation, the data does meet both of these assumptions. The ANOVA test is fairly robust to the slight departure from normality presented by the log transformed data, and the variances are equal. The data is clearly independent, so that assumption is met. Therefore, all assumptions of ANOVA are met by the log transformed data. Hypothesis Definition In this problem, our Null (Reduced Model) Hypothesis, H0 , is that all the groups have the same distribution and our Alternative (Full Model) Hypothesis, H1 is that the distributions are different. Mathematically, that is written as: H0 :mediangrand H1 :median<12 mediangrand median12 mediangrand median13−15 mediangrand median16 mediangrand median>16 (28.1.1) (28.1.2) We will consider our confidence level, α to be 0.05 F Statistic To conduct this hypothesis test, the following SAS code was used: proc glm data = LogEduData; class educ; model LogIncome = educ; run; This results in the following ANOVA Output: Figure 28.1.1. ANOVA Table Figure 28.1.1 tells us what our F statistic is. We see that F = 62.87 172 (28.1.3) Analysis Guide Midterm P-value Figure 28.1.1 also tells us our p-value. In this case, (28.1.4) p < .0001 Hypothesis Assessment In this scenario, we have that p < .0001 < α = .05 and therefore we reject the null hypothesis. Conclusion There is substantial evidence (p < 0.0001) that at least one of the distributions is different from the others. 28.2 Tukey’s test We want to compare all of the group means to see if they are different, so we do tukey’s test! we do this with the following SAS code: With this we see that aside from the college and graduate school educations, Code 28.1. Tukeys test in SAS and R proc glm data = LogEduData; class educ; model LogIncome = educ; lsmeans LogIncome / pdiff = ALL adjust=tukey cl; run; and the following R code (and output) 1 2 3 4 5 6 edudata <- read.csv(file=’c:/Users/david/Desktop/MSDS/MSDS6371/Homework/Week6/ Data/ex0525.csv’, header=TRUE , sep = ",") edudata$logincome <- log(edudata$Income2005) prob3 <- edudata aovmodel2 <- aov(logincome~Educ ,data =prob3) tukkey <- glht(aovmodel2 ,linfct=mcp(Educ="Tukey")) summary(tukkey) 7 8 Simultaneous Tests for General Linear Hypotheses 9 10 Multiple Comparisons of Means: Tukey Contrasts 11 12 13 Fit: aov(formula = logincome ~ Educ , data = prob3) 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) <12 - <<12 == 0 -0.32787 0.08493 >16 - <<12 == 0 0.67069 0.05624 13 -15 - <<12 == 0 0.16400 0.04674 16 - <<12 == 0 0.56987 0.05459 >16 - <12 == 0 0.99856 0.09316 13 -15 - <12 == 0 0.49187 0.08775 16 - <12 == 0 0.89775 0.09217 13 -15 - >16 == 0 -0.50669 0.06041 16 - >16 == 0 -0.10082 0.06668 16 - 13-15 == 0 0.40588 0.05888 --- -3.861 11.926 3.509 10.439 10.719 5.606 9.740 -8.387 -1.512 6.893 0.00101 < 0.001 0.00389 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 0.54057 < 0.001 ** *** ** *** *** *** *** *** *** they are all different. A confidence interval for these differences, the % change of the medians, is calculated by raising e to the confidence interval, and subtracting one from that and multiplying by 100. These are shown in the following figure: 173 Analysis Guide Midterm Figure 28.2.1. Tukey CIs on percent increase in the median Dunnett’s Test To compare to a control, dunnets test is the best! We do this with the following SAS code: lets look at the Code 28.2. DUnnett’s test proc glm data = LogEduData; class educ; model LogIncome = educ; lsmeans LogIncome / pdiff = ALL adjust=dunnett cl; run; and the following R code (and output!). 1 summary(dunnbett) #Dunnett 2 3 Simultaneous Tests for General Linear Hypotheses 4 5 Multiple Comparisons of Means: Dunnett Contrasts 6 7 8 Fit: aov(formula = logincome ~ Educ , data = prob3) 9 10 11 12 13 14 15 16 Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) <12 - <<12 == 0 -0.32787 0.08493 >16 - <<12 == 0 0.67069 0.05624 13 -15 - <<12 == 0 0.16400 0.04674 16 - <<12 == 0 0.56987 0.05459 --- -3.861 0.000461 *** 11.926 < 1e-04 *** 3.509 0.001818 ** 10.439 < 1e-04 *** SAS output too! Figure 28.2.2. SAS p values We see that all of the groups are different from the control. We can calculate confidence intervals on 174 Analysis Guide Midterm how much percent different by raising e to the power of the CI, and then subtracting one and multiplying by 100, as seen in the next figure Figure 28.2.3. Dunnett CIs on percent increase in the median 175 Chapter 29 Unit 6 lecture slides lol 176 10/13/2018 Overview UNIT 6 Live Session Contrasts Multiple Comparison • ANOVA provides an F-test for equality of several means • The main weaknesses are • It doesn’t tell us which means are different • It doesn’t account for any structure in the groups (Example: Is the average treatment effect across 3 levels of treatments different from the placebo?) • The downside to this more refined analysis is that we need to control for the number of comparisons we end up making Example: Handicap & Capability Study Example: Handicap & Capability Study • Goal: How do physical handicaps affect perception of employment qualification? • Do subjects systematically evaluate qualifications differently according to handicap? • If so, which handicaps are evaluated differently? • (Cesare, Tannenbaum, and Dalessio “Interviewers’ decisions related to applicant handicap type and rater empathy” (1990) Human Performance) • The researchers prepared 5 video taped job interviews with same actors • The tapes differed only in the handicap of the applicant: • • • • • No handicap (This is the control group) One leg amputated Crutches Hearing Impaired Wheelchair • 14 students were randomly assigned to each tape to rate applicants: 0-10 pts (70 students total.) 1 10/13/2018 Is There Any Difference at All? • We should begin any analysis involving several groups by using the ANOVA framework • If there isn’t any (statistically) significant difference in the population means, then there is no reason to address more refined questions • The tapes differed only in the handicap of the applicant: • • • • • No handicap (This is the control group.) One leg amputated Crutches Hearing Impaired Wheelchair Handicap & Capability Study: Equal Variances Assumption Handicap & Capability Study: Normality Assumption There is NO visual evidence to suggest that the data are not normally distributed. We will proceed with the assumption of normally distributed groups. Handicap & Capability Study: ANOVA results There is evidence to support the claim that at least two population means are different from each other (p-value of 0.0301 from a 1-way ANOVA). There is NO evidence to suggest variances are unequal. Notice that since there is virtually no evidence of a difference in standard deviations, Welch’s test is almost identical to the pure F ANOVA. 2 10/13/2018 Handicap & Capability Study: More Specific Questions Linear Combinations & Contrasts (this requires independence) (CONTRAST) Handicap & Capability Study: A Contrast Handicap & Capability Study: A Contrast Calculate mean difference and standard error. There is evidence that the sum of points assigned to Amp & Hear handicaps is smaller than the sum of points assigned to Crutch & Wheel handicaps at level alpha equal to 0.05 because the CI does not contain 0. CI: Point estimate ± multiplier* standard error 3 10/13/2018 Chapter 6: Compare with book! Handicap & Capability Study: In SAS Order = data keeps the data in the order it came in, so that “none” group is first and can be assigned a coefficient of 0. Note the sign switch and division by 2 of the coefficients. Comes in handy when doing division by hand would result in the need to input a rounded number (example 0.33) Handicap & Capability Study: In SAS Handicap & Capability Study: In SAS Confidence Intervals There is evidence that the average points assigned to Amp & Hear handicaps is smaller than the average points assigned to Crutch & Wheel handicaps (t-tools linear contrast p-value of 0.0022). We estimate that this difference is -1.39 pts with an associated 99% confidence interval of…. 99% CI for the difference in averages of Amp and Hear vs. Crutch and Wheel: Point estimate ± multiplier* standard error -1.39±2.65*0.436 -1.39±1.155 Three different ways (contrast, estimate, estimate with divisor =2) to test for the same idea. (There are many more than three!) (-2.55, -0.23), which of course does not include 0 4 10/13/2018 Chapter 6 Let’s Try Some from Spock Example!! Groups: A, B, C, D, E, F, S With no Order = data in the code, the contrasts are assigned in alphabetical order, so that “none” group is fourth. Contrast vector (assume alphabetical order): Answer on Next Slide -> Let’s Try Some from Spock Example!! Let’s Try ANOTHER (from Spock)!! Groups: A, B, C, D, E, F, S Groups: A, B, C, D, E, F, S Contrast vector (assume alphabetical order): -1 -1 -1 -1 -1 -1 6 Contrast vector (assume alphabetical order): 5 10/13/2018 Let’s Try ANOTHER (from Spock)!! Groups: A, B, C, D, E, F, S Let’s Try ONE MORE (from Spock)!! Groups: A, B, C, D, E, F, S Contrast vector (assume alphabetical order): 1 1 1 -1 -1 -1 0 ADDITIONAL QUESTION: Why is it better to include the Spock data in the calculation of the pooled SD (and thus the MSE) even though the hypothesis does not include it? Let’s Try ONE MORE (from Spock)!! Contrast vector (assume alphabetical order): Answer on Next Slide -> Multiple Comparison: Motivation Groups: A, B, C, D, E, F, S K tests Contrast vector (assume alphabetical order): 3 0 3 -2 -2 -2 0 6 10/13/2018 Multiple Comparison: Example k = 100 Gene 1 Gene 5 Gene 9 Gene 2 Gene 6 Gene 10 Gene 98 Confidence Intervals Gene 97 Gene 3 Gene 7 Gene 11 … Gene 99 Gene 4 Gene 8 Gene 12 Gene 100 When we make a correction for multiple comparisons, it is the critical value in the hypothesis test and thus the multiplier in the confidence interval that is adjusted. *The multiplier is usually the same as the critical value for a hypothesis test. Planned & Post-hoc Tests A planned test is one in which you know the comparisons (tests) you want to make before you look at the data. If you have k planned comparisons then you need to correct for just those k comparisons. Post-Hoc / Unplanned Tests Post Hoc tests are appropriate when: 1. The researcher wants to examine all possible comparisons among pairs of group means (or a large number of comparisons). 2. Predictions about which groups will differ are not made prior to setting up the analysis. 7 10/13/2018 Multiple Comparison: Bonferroni Multiple Comparison: Tukey-Kramer Multiplier = For a set of Bonferroni adjusted t-tests, (α/k) we must have normal distributions, equal spreads, and independence (same as typical t-tests). However, the Bonferroni correction can be extended to tests that have no assumptions about distributions (e.g. rank sum test). For any set of independent parametric or non-parametric tests, the Bonferroni correction works the same. This approach is very conservative, meaning that the intervals are much wider than the nominal level, particularly if the tests are not really independent. The Tukey-Kramer adjustment is a modification to this test to account for different sample sizes in the groups. Assumes normal distributions, equal spreads, independence (same as typical t-tests), and equal group sample sizes. More consistent than Bonferroni with respect to Type I Error but not robust to its assumptions…. Bonferroni is a good alternative when the assumptions are violated. Multiple Comparison: Dunnett Many Groups to one Control … Studentized Range Statistic Table Handicap / Capability Study: Data Assumes normal distributions, equal spreads, and independence (same as typical t-tests). Replaces t-distribution with a multivariate tdistribution (n=# of groups versus control), where the tests are not independent. 8 10/13/2018 Handicap Data Analysis First Test!!! Questions of Interest: 1. Is there any evidence that at least one pair of mean qualification scores are different from each other? 2. Let’s say we are only interested in Amputee versus None. Test the claim the Amputee has a different mean score than the None group. 3. Now let’s assume that we are interested in identifying specific differences between any two of the group means. Find evidence of any differences in the means between the groups. 4. Next, assume that we were interested in testing the means of the handicapped groups to the non-handicap group. Test this claim and identify any significant differences. Normality: Handicap Data There is no visual evidence to suggest that the data are not normally distributed. We will proceed with the assumption of normally distributed groups. Homogeneity of SD Assumption There is no evidence to suggest variances are unequal. Independence may be violated here. We are going to proceed anyway for the sake of the example. 9 10/13/2018 First QOI!!! Second QOI!!! 2. Let’s say we are only interested in Amputee versus None. Test the claim the Amputee has a different mean score than the None group. 1. Is there any evidence that at least one pair of mean qualification scores are different from each other? There is sufficient evidence to suggest at the alpha = .05 level of significance (p-value = .0301) that at least 2 of the means are different from each other in this standard ANOVA. Second QOI: Better approach!!! 2. Let’s say we are only interested in Amputee versus None. Test the claim the Amputee has a different mean score than the None group. There is not sufficient evidence to suggest that the mean qualification rating of the amputee group is different than the group with no handicap (p-value = .4477 from a contrast using all available data). Even though the p-values for the two tests are only slightly different, it is better to use all available data (the procedure on the right). Comparing a pair of means can be just a simple contrast. The results of these tests are equivalent! There is not sufficient evidence to suggest that the mean qualification rating of the amputee group is different than the group without handicap. (P-value = .4678 from a t-test and an ANOVA using only these two groups.) Third QOI!!! Now let’s assume that we are interested in identifying specific differences between any two group means. Find evidence of any differences in the means between the groups. There are 10 different two sided tests conducted here; thus, we need to adjust alpha per test to be .05/10 = .005. With this adjustment, only one of the tests has a statistically significant result. Therefore, there is evidence (p-value = .0035 from a t-test) that the crutches and hearing groups have different mean qualification rating scores. We will provide a confidence interval in a few slides. 10 10/13/2018 Third QOI!!! Bonferroni Adjusted P-Values P-values not adjusted- compare to individual alpha Now let’s assume that we are interested in identifying specific differences between any two group means. Find evidence of any differences in the means between the groups. P-values adjusted- compare to familywise alpha A 95% confidence interval for the difference in means of the crutches and hearing groups is (.0779, 3.66499). Compare to alpha = 0.05 Compare to alpha = 0.005 x 10, up to 1 Third QOI!!! Now let’s assume that we are interested in identifying specific differences between any two group means. Find evidence of any differences in the means between the groups. A 95% confidence interval for the difference in means of crutches and hearing groups is (.0779, 3.66499). *Slightly different code from the last slide, producing slightly different output. Note the cl versus cldiff. 4th QOI: Next, assume that we are interested in testing the means of the handicapped groups with the non-handicapped group. Test this claim and identify any significant differences. (Using CIs) There is NOT sufficient evidence in this study to suggest that there are any differences between the average of the means of each handicap group and the mean of the group without handicap. The 95% family-wise confidence intervals are constructed using Dunnett’s procedure. All CIs contain zero, thus not providing sufficient evidence to conclude that the difference is not zero. (The study results do not constitute sufficient evidence to support the claim that any means tested are individually different than the control.) Specify the control group 11 10/13/2018 4th QOI: Next, assume that we were interested in testing the means of the handicapped groups with the non-handicap group. Test this claim and identify any significant differences. (Using HTs) R Code for Handicap Example Question 1 Question 1: Reading in Data and ANOVA Hypothesis tests also conclude that there is not sufficient evidence to suggest that there are any differences between the means of each handicapped group and the mean of the of the group without handicap. The above Dunnett adjusted p-values are all greater than alpha = .05, as is visible from the table above. R Code for Handicap Example Question 2 R Code for Handicap Example Question 3 Note: Must Load multcomp package Note: Must Load pairwiseCI package Note: Must Load multcomp package 12 10/13/2018 R Code for Handicap Example Question 4 Appendix Note: Must Load multcomp package Bonferroni’s Correction Bonferroni’s Correction 13 10/13/2018 Bonferroni’s Correction Multivariate distribution • A multivariate distribution is distribution of a vector of conditional random variables. • Bivariate normal distribution can easily be shown graphically. 14 Part VII Workflow for testing hypotheses 191 CHOOSING A HYPOTHESIS TEST RESEARCH STRUCTURE NORMAL DISTRIBUTION SAMPLE SIZE VARIANCE DATA TRANSFORMATION MULTIPLE HYPOTHESIS TEST NO ONE SAMPLE Difference between mean of independent samples and a hypothesized mean Single measure or observation parametric ONE-SAMPLE T-TEST Inference on means (medians if log-transform) YES (CLT) NO (w/LOG TRANSFORMATION)* EVIDENCE AGAINST NORMALITY? MATCHED PAIRS Difference between same group before and after treatment (within-groups) Repeated measures or observations SUFFICIENT SAMPLE SIZE? YES 1 Read the problem carefully. Is it a randomized experiment or an observational study? 2 Plot the data using histograms, box plots, or QQ plots. 3 Determine which test to use. Do the data satisfy the test’s assumptions? noonparametric SIGN TEST or WILCOXON SINGED RANK TEST Inference on medians NO HYPOTHESIS TESTING STEP-BY-STEP 4 State the null and alternative hypotheses. Is this a one-sided or two-sided test? 5 Select a test statistic and confidence level (1-α). Find the critical value. UNPAIRED TESTING (TWO SAMPLES) Difference between independent groups (between-groups) Single measure or observation parametric POOLED TWO-SAMPLE T Inference on means YES NO SAME SAMPLE SIZES? NO NO EVIDENCE AGAINST SAME STANDARD DEVIATION? YES EVIDENCE AGAINST NORMALITY? parametric WELCH’S T Inference on means YES (CLT) YES SUFFICIENT SAMPLE SIZE? nonparametric WILCOXON RANK SUM (aka Mann-Whitney U Test) Inference on medians NO UNPAIRED TESTING (MORE THAN TWO SAMPLES) Difference between independent groups (between-groups) Single measure or observation parametric WELCH’S ANOVA Inference on means YES NO EVIDENCE AGAINST SAME STANDARD DEVIATION? YES (w/LOG TRANSFORMATION)* parametric ONE-WAY ANOVA Inference on means (medians if log-transform) YES (CLT) YES nonparametric KRUSKAL-WALLIS Inference on medians SUFFICIENT SAMPLE NO SIZE? 7 Compute the test statistic and the probability (p-value) of obtaining the observed results if the null hypothesis is true. 8 Reject or fail to reject the null hypothesis. (Never accept the null hypothesis.) 9 Perform post hoc testing, if applicable, to determine which groups are different. 10 State the statistical conclusion in the context of the original problem. TUKEY-KRAMER (aka TUKEY’S HSD) DUNNETT for comparison to a control group YES (w/LOG-TRANSFORMATION)* NO EVIDENCE AGAINST NORMALITY? 6 Sketch the distribution, including the critical value and the acceptance and/or rejection region(s). BONFERRONI CORRECTION distribution-free, more conservative, wider interval REGWQ Lower Type II error rate than either Bonferroni or Tukey-Kramer POST HOC TESTS * TESTS USING LOG-TRANSFORMED DATA (INFERENCE ON MEDIANS) Rev. 5 (6/25/2015) Michael Burkhardt • mburkhardt@smu.edu Analysis Guide Midterm note that the nonparamteric ones do medians, kruskal is nonparametric for ANOVA 193
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 194 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref Producer : pdfTeX-1.40.19 Create Date : 2018:10:13 10:48:12-05:00 Modify Date : 2018:10:13 10:48:12-05:00 Trapped : False PTEX Fullbanner : This is MiKTeX-pdfTeX 2.9.6839 (1.40.19)EXIF Metadata provided by EXIF.tools