Math 445—One-Factor Analysis of Variance (ANOVA)

 

Suppose we have I distinct normal populations with respective means , but the same variance, . Furthermore, suppose we have independent random samples of size J from each of these populations. We want to compare the means of these populations—are they the same, or are some (or all) of the means different?

 

These conditions are most reasonable in a one-factor, randomized experiment, where experimental units are randomly assigned to one of I treatments. Unlike the general two-sample problem, in this situation it’s reasonable to assume the treatments do not affect the variability—just the mean—so we can assume  stays constant.

 

Example

Suppose a clinical trial is run, where there are 3 treatments (placebo, drug A, drug B) and the response variable is the decrease in blood pressure. It seems reasonable to think the treatments might affect the average blood pressure, but won’t change the variability in blood pressures. We want to know if the average blood pressure for each treatment group is the same (that is, the drugs aren’t any more affective, on average, than the placebo), or if there is a difference in the average blood-pressure decrease, depending on the treatment.

 

Visual Example

Before we develop the appropriate statistical theory, consider a visual example. Included below are boxplots showing the results of two different completely-randomized experiments (each with 4 treatments).

 

It seems clear that the second experiment shows treatment means that are significantly different from each other (whereas the first experiment does not). Why? Because the variability between the treatments in experiment 2 is so much larger than the variation within the treatments. As part of our analysis, we will specifically compare these two types of variation.

 

Notation, Definitions, Development of Sums-of-Squares, and Determination of Distributions

See your class notes. (All this good stuff will be worked out—on the white board—in class.)

 

One-factor/One-way Analysis-Of-Variance (ANOVA) Model

We can write the ANOVA conditions is a “model”-based way:

, where the  are independent,  random variables

(note: this implies that the  are independent,  random variables).

 

Then to test  against ( at least two of the  are different):

·         Check conditions (i.e., normality and equal variances) of the procedure—are they plausible for this particular data set?

 

·         Calculate the test statistic, .

 

·         Determine the p-value using the appropriate F distribution: p-value , where f is the observed value of the test statistic (recall we only reject for large values of the test statistic).

 

·         If a statistically significant difference in means is found, conduct appropriate multiple comparisons to see where the specific significant differences lie (more on this later). Also consider the practical significance.

 

·         Provide a conclusion in the words/context of the particular problem.

Important Remarks to Accompany the One-Factor ANOVA Methods

·         As a rule-of-thumb, the equal-variances condition has been met (i.e., seems plausible) if the largest sample standard deviation is no more than twice the smallest standard deviation. (The textbook mentions the Levene test, but you can simply use this rule-of-thumb.) If this condition is not met, then we can attempt to transform the data (often natural log or square root transformation work well ). Then the ANOVA can be re-run on the transformed data.

 

·         The normality condition should also be validated. This is difficult to check for the individual groups (since J might be small). Note that in our model statement, the normality condition is actually on the error terms . We can estimate these errors from our data:  (this has the form Data = Fit + Residual). The estimated errors, , are called the residuals. We can look at graphs (e.g., histogram, normal-probability plot) of the residuals in order to assess the normality condition.

 

·         When the alternative hypothesis is true (i.e., at least two of the  are different), the test statistic has what is called a non-central F distribution (with an additional “non-centrality” parameter). Hence, power calculations are complicated—we’ll rely on Minitab for these.

 

·         Thus far, we have assumed equal samples sizes for each treatment (this makes the notation cleaner and the derivations less grungy). This is called a balanced design and is desirable, but not necessary. ANOVA can still be applied to unbalanced designs, where  (we’ll let Minitab deal with the grunge).

 

·         This ANOVA model is a “fixed effects” model—that is, the levels of the factor in the experiment are the only ones considered relevant. In some situations, there might be very many relevant factor levels of which only a subset can be tested. When the tested levels are chosen at random from the relevant population of all levels, then a different model is needed—a “random effects” model. In our class, we’ll focus on fixed-effects models. (But realize you can read the book to learn more about random-effects models.)

 

Total Sum-of-Squares and General ANOVA Table

See your class notes. (All this good stuff will be worked out—on the white board—in class.)

 

One-Factor ANOVA Example

Many studies have suggested that there is a link between exercise and healthy bones. Exercise stresses the bones and this causes them to get stronger. One study (done at Purdue University) examined the effect of jumping on the bone density of growing rats. There were three treatments: a control with no jumping, a low-jump condition (the jump height was 30 cm), and a high-jump condition (60 centimeters). After 8 weeks of 10 jumps per day, 5 days per week, the bone density of the rats (expressed in mg/cm3) was measured. (Isn’t it fun to think of rats jumping?) Comparative boxplots and separate numerical summaries for the treatment groups are shown below.

 

 

Variable               Treatment        N     Mean   StDev   Minimum       Q1   Median       Q3   Maximum

Bone Density (mg/cm3)  1 - Control     10   601.10   27.36    554.00   587.00   601.50   615.75    653.00

                       2 - Low Jump    10   612.50   19.33    588.00   595.50   606.00   632.75    638.00

                       3 - High Jump   10   638.70   16.59    622.00   625.00   637.00   650.00    674.00

 

Per usual, the first step in any statistical analysis is to summarize the variables graphically and numerically (this gives you a first look descriptively that allows you, among other things, to see if there are any issues with the data—for example, outliers or mis-recorded observations).

 

We want to test  against  at least two of the means are different. First we’ll assess the equal-variances condition of the ANOVA test. From the boxplots, the variability in bone densities looks somewhat similar for the three treatment groups (not exactly the same, but not wildly different). Using our rule-of-thumb, the largest standard deviation (27.36 mg/cm3) is less than twice the smallest standard deviation (16.59 mg/cm3), so the equal-variance condition seems plausible for these data.

 

From Minitab we get the following ANOVA table, as well as graphs (histogram and normal probability plot) of the residuals.

 

One-way ANOVA: Bone Density (mg/cm3) versus Treatment

Source     DF     SS    MS     F      P

Treatment   2   7434  3717  7.98  0.002

Error      27  12580   466

Total      29  20013

 

 

 

Recall the residual plots allow us to assess the normality condition (since checking the normality separately for each treatment group can be difficult if there are few observations). The residuals seem to roughly follow a normal distribution, so this condition seems plausible for these data.

 

Note the P-value for our test is 0.002. That is, assuming the average bone density is the same for all rats regardless of treatment, there is only a 0.002 chance of getting our particular sample results or more extreme results. This gives strong evidence that there is a difference in average bone density for at least two of the treatment groups.

 

This begs the question: between which treatment groups is there a significant difference in average bone density?

 

Multiple Comparisons

If the ANOVA null hypothesis of equal population means cannot be rejected, then the analysis stops. If there is evidence of some difference in population means, then the natural next step is to find where exactly the differences are. There is, though, an important issue to consider when doing multiple tests on the same set of data.

 

Each individual test has a Type I error rate, . But what is the overall or “family” error rate? For example, if two independent tests are performed at the 0.05 level, then P(no error in either test) = (.95)(.95) = 0.9025, so the overall error rate is actually 1 – 0.9025 = 0.0975, not 0.05.

 

The family Type I error rate is the chance of at least one false alarm somewhere among the set of tests, when all underlying populations averages are equal. When testing on the same data, the tests are not independent, so we can’t use same simplistic argument as above. But we can apply Boole’s/Bonferroni’s inequality (space below for you to write in our work):

 

 

 

 


 

The Bonferroni method for multiple comparisons tests each pair of means using a  error rate, where m is the total number of tests and  is the desired overall/family error rate. This method is easy to apply, but is conservative (i.e., the overall error rate might actually be far smaller than we think it is). Note: Our textbook doesn’t mention this method.

 

There are many methods of adjusting for multiple comparisons. The most-commonly used method (when a computer is available for the analysis) is one developed by John Tukey. His method (shown in our textbook) utilizes a distribution called the studentized-range distribution. You don’t need to know all the details, but here’s the gist of Tukey’s method:

For each  form the interval  , where  is a multiplier that takes into account  as the family error rate (Minitab will do this for us). If an interval does not include 0, then there is a significant difference in those means. (Note: If the family confidence level is 95%, then each of the individual confidence levels will be higher than 95%.)

 

Important Remarks about Multiple Comparisons

·         In general, statisticians typically rate “not adjusting for multiple comparisons on the same set of data” as a grave error in methodology that can give misleading results.

 

·         The Bonferroni method is a quick adjustment that can be done easily “by hand,” but (especially for large numbers of comparisons), this method is very conservative.

 

·         Tukey’s method is typically less conservative than Bonferroni, yet it still adjusts for multiple comparisons. This method is used most often in practice. Minitab makes it easy to get Tukey’s intervals (even for unequal sample sizes).

 

·         If you are interested more descriptively than inferentially in the data, and simply want to see where there might be significant differences in order to set up your next experiment, then you can ignore the multiple-comparison issue and do each individual test at, say, the 0.05 level. But you must think of your p-values descriptively—they only give you information on how to focus your next experiment (this is especially helpful if you have a huge data set with a large number of comparisons). The pitfall of many researchers is they want results right away, so the initially think they’ll investigate descriptively (to set up the next experiment), but then they get caught up in what they find and want to publish it right away. Please don’t be that impatient researcher!

 

 

 

Back to the Rat-Jumping Example

Since we found a significant difference between at least two means, it’s appropriate to now compare the means pairwise. Included on the next page is the Minitab output for Tukey’s method (which adjusts for multiple comparisons, and has a family error rate of 5%) and for Fisher’s method (which doesn’t adjust for multiple comparisons). Not adjusting for multiple comparisons (Fisher’s method) is appropriate only if you’re simply exploring the data looking for interesting effects to investigate in another experiment. Notice that, in this case, regardless of whether we adjust for multiple comparisons we see a significant difference in average bone density between the control group and the high-jump group and between the low-jump group and high-jump group (but not between the control group and the low-jump group).

 

 


 

Tukey 95% Simultaneous Confidence Intervals

All Pairwise Comparisons among Levels of Treatment

 

Individual confidence level = 98.04%

 

Treatment = 1-Control subtracted from:

 

Treatment       Lower  Center  Upper  -------+---------+---------+---------+--

2-Low Jump     -12.56   11.40  35.36               (-------*-------)

3-High Jump     13.64   37.60  61.56                        (-------*-------)

                                      -------+---------+---------+---------+--

                                           -30         0        30        60

 

 

Treatment = 2-Low Jump subtracted from:

 

Treatment      Lower  Center  Upper  -------+---------+---------+---------+--

3-High Jump     2.24   26.20  50.16                    (-------*-------)

                                     -------+---------+---------+---------+--

                                          -30         0        30        60

 

 

 

 

Fisher 95% Individual Confidence Intervals

All Pairwise Comparisons among Levels of Treatment

 

Simultaneous confidence level = 88.07%

 

 

Treatment = 1-Control subtracted from:

 

Treatment      Lower  Center  Upper  -----+---------+---------+---------+----

2-Low Jump     -8.41   11.40  31.21              (------*-----)

3-High Jump    17.79   37.60  57.41                       (------*-----)

                                     -----+---------+---------+---------+----

                                        -30         0        30        60

 

 

Treatment = 2-Low Jump subtracted from:

 

Treatment      Lower  Center  Upper  -----+---------+---------+---------+----

3-High Jump     6.39   26.20  46.01                   (------*-----)

                                     -----+---------+---------+---------+----

                                        -30         0        30        60