Math 445—One-Factor Analysis of Variance
(ANOVA)
Suppose
we have I distinct normal populations with respective means
, but the same variance,
. Furthermore, suppose we
have independent random samples of size J from each of these populations. We
want to compare the means of these populations—are they the same, or are some
(or all) of the means different?
These conditions are most reasonable in a
one-factor, randomized experiment, where experimental units are randomly
assigned to one of I treatments. Unlike the general two-sample problem, in this
situation it’s reasonable to assume the treatments do not affect the
variability—just the mean—so we can assume
stays constant.
Example
Suppose
a clinical trial is run, where there are 3 treatments (placebo, drug A, drug B)
and the response variable is the decrease in blood pressure. It seems
reasonable to think the treatments might affect the average blood pressure, but
won’t change the variability in blood pressures. We want to know if the average
blood pressure for each treatment group is the same (that is, the drugs aren’t
any more affective, on average, than the placebo), or if there is a difference
in the average blood-pressure decrease, depending on the treatment.
Visual Example
Before we
develop the appropriate statistical theory, consider a visual example. Included
below are boxplots showing the results of two
different completely-randomized experiments (each with 4 treatments).


It
seems clear that the second experiment shows treatment means that are
significantly different from each other (whereas the first experiment does
not). Why? Because the variability
between the treatments in experiment 2 is so much larger than the variation
within the treatments. As part of our analysis, we will specifically
compare these two types of variation.
Notation, Definitions, Development of
Sums-of-Squares, and Determination of Distributions
See your class notes. (All this good stuff
will be worked out—on the white board—in class.)
One-factor/One-way Analysis-Of-Variance
(ANOVA) Model
We can write the ANOVA conditions is a
“model”-based way:
, where the
are independent,
random variables
(note:
this implies that the
are independent,
random variables).
Then
to test
against (
at least two of the
are different):
·
Check
conditions (i.e., normality and equal
variances) of the procedure—are they plausible for this particular data set?
·
Calculate
the test statistic,
.
·
Determine
the p-value using the appropriate F distribution: p-value
, where f
is the observed value of the test statistic (recall we only reject for large
values of the test statistic).
·
If
a statistically significant difference in means is found, conduct appropriate
multiple comparisons to see where the specific significant differences lie
(more on this later). Also consider the practical significance.
·
Provide
a conclusion in the words/context of the particular problem.
Important Remarks to Accompany the
One-Factor ANOVA Methods
·
As
a rule-of-thumb, the equal-variances condition has been met (i.e., seems plausible) if the largest
sample standard deviation is no more than twice the smallest standard
deviation. (The textbook mentions the Levene test,
but you can simply use this rule-of-thumb.) If this condition is not met, then we
can attempt to transform the data (often natural log or square root
transformation work well ). Then the ANOVA can be re-run on the transformed
data.
·
The
normality condition should also be validated. This is difficult to check for
the individual groups (since J might be small). Note that in our model
statement, the normality condition is actually on the error terms
. We can estimate these
errors from our data:
(this has the form Data = Fit + Residual). The estimated
errors,
, are called the
residuals. We can look at graphs (e.g.,
histogram, normal-probability plot) of the residuals in order to assess the
normality condition.
·
When
the alternative hypothesis is true (i.e.,
at least two of the
are different), the
test statistic has what is called a non-central F distribution (with an additional
“non-centrality” parameter). Hence, power calculations are complicated—we’ll
rely on Minitab for these.
·
Thus
far, we have assumed equal samples sizes for each treatment (this makes the
notation cleaner and the derivations less grungy). This is called a balanced
design and is desirable, but not necessary. ANOVA can still be applied to
unbalanced designs, where
(we’ll let Minitab deal
with the grunge).
·
This
ANOVA model is a “fixed effects” model—that is, the levels of the factor in the
experiment are the only ones considered relevant. In some situations, there
might be very many relevant factor levels of which only a subset can be tested.
When the tested levels are chosen at random from the relevant population of all
levels, then a different model is needed—a “random effects” model. In our
class, we’ll focus on fixed-effects models. (But realize you can read the book
to learn more about random-effects models.)
Total Sum-of-Squares and General ANOVA
Table
See your class notes. (All this good stuff
will be worked out—on the white board—in class.)
One-Factor ANOVA Example
Many
studies have suggested that there is a link between exercise and healthy bones.
Exercise stresses the bones and this causes them to get stronger. One study
(done at Purdue University) examined the effect of jumping on the bone density
of growing rats. There were three treatments: a control with no jumping, a
low-jump condition (the jump height was 30 cm), and a high-jump condition (60
centimeters). After 8 weeks of 10 jumps per day, 5 days per week, the bone
density of the rats (expressed in mg/cm3) was measured. (Isn’t it
fun to think of rats jumping?) Comparative boxplots
and separate numerical summaries for the treatment groups are shown below.

Variable
Treatment N Mean
StDev
Minimum Q1 Median
Q3 Maximum
Bone
Density (mg/cm3) 1 - Control 10
601.10 27.36 554.00
587.00 601.50 615.75
653.00
2 - Low Jump 10
612.50 19.33 588.00
595.50 606.00 632.75
638.00
3 - High Jump 10
638.70 16.59 622.00
625.00 637.00 650.00
674.00
Per usual, the first step in any
statistical analysis is to summarize the variables graphically and numerically
(this gives you a first look descriptively that allows you, among other things,
to see if there are any issues with the data—for example, outliers or mis-recorded observations).
We
want to test
against
at least two of the means are different. First
we’ll assess the equal-variances condition of the ANOVA test. From the boxplots, the variability in bone densities looks somewhat
similar for the three treatment groups (not exactly the same, but not wildly
different). Using our rule-of-thumb, the largest standard deviation (27.36
mg/cm3) is less than twice the smallest standard deviation (16.59
mg/cm3), so the equal-variance condition seems plausible for these
data.
From
Minitab we get the following ANOVA table, as well as graphs (histogram and
normal probability plot) of the residuals.
One-way ANOVA: Bone Density (mg/cm3) versus Treatment
Source DF
SS MS F
P
Treatment 2
7434 3717 7.98
0.002
Error 27
12580 466
Total 29
20013


Recall
the residual plots allow us to assess the normality condition (since checking
the normality separately for each treatment group can be difficult if there are
few observations). The residuals seem to roughly follow a normal distribution,
so this condition seems plausible for these data.
Note
the P-value for our test is 0.002. That is, assuming the average bone density
is the same for all rats regardless of treatment, there is only a 0.002 chance
of getting our particular sample results or more extreme results. This gives
strong evidence that there is a difference in average bone density for at least
two of the treatment groups.
This begs the question: between which
treatment groups is there a significant difference in average bone density?
Multiple Comparisons
If the ANOVA null hypothesis of equal
population means cannot be rejected, then the analysis stops. If there is evidence
of some difference in population means, then the natural next step is to find
where exactly the differences are. There is, though, an important issue to consider when doing multiple tests on the same
set of data.
Each
individual test has a Type I error rate,
. But what is the
overall or “family” error rate? For example, if two independent tests are
performed at the 0.05 level, then P(no error in either test) = (.95)(.95) =
0.9025, so the overall error rate is actually 1 – 0.9025 = 0.0975, not 0.05.
The family Type I error rate is the
chance of at least one false alarm somewhere among the set of tests, when all
underlying populations averages are equal. When testing on the same data, the tests
are not independent, so we can’t use same simplistic argument as above. But we
can apply Boole’s/Bonferroni’s inequality (space
below for you to write in our work):
The
Bonferroni method for multiple comparisons tests
each pair of means using a
error rate, where m is the total number of tests and
is the desired overall/family error rate. This
method is easy to apply, but is conservative
(i.e., the overall error rate might
actually be far smaller than we think it is). Note: Our textbook doesn’t
mention this method.
There
are many methods of adjusting for multiple comparisons. The most-commonly used
method (when a computer is available for the analysis) is one developed by John
Tukey. His method (shown in our textbook) utilizes a
distribution called the studentized-range
distribution. You don’t need to know all the details, but here’s the gist of Tukey’s method:
For
each
form the interval
, where
is a multiplier that takes into account
as the family error rate (Minitab will do this
for us). If an interval does not include
0, then there is a significant difference in those means. (Note: If the
family confidence level is 95%, then each of the individual confidence levels
will be higher than 95%.)
Important Remarks about Multiple
Comparisons
·
In
general, statisticians typically rate “not adjusting for multiple comparisons
on the same set of data” as a grave error in methodology that can give
misleading results.
·
The
Bonferroni method is a quick adjustment that can be
done easily “by hand,” but (especially for large numbers of comparisons), this
method is very conservative.
·
Tukey’s method is typically
less conservative than Bonferroni, yet it still
adjusts for multiple comparisons. This method is used most often in practice.
Minitab makes it easy to get Tukey’s intervals (even
for unequal sample sizes).
·
If
you are interested more descriptively than inferentially in the data, and
simply want to see where there might be significant differences in order to set
up your next experiment, then you can ignore the multiple-comparison issue and
do each individual test at, say, the 0.05 level. But you must think of your p-values descriptively—they only give
you information on how to focus your next experiment (this is especially
helpful if you have a huge data set with a large number of comparisons). The
pitfall of many researchers is they want results right away, so the initially
think they’ll investigate descriptively (to set up the next experiment), but
then they get caught up in what they find and want to publish it right away.
Please don’t be that impatient researcher!
Back to the Rat-Jumping Example
Since we found a significant difference
between at least two means, it’s appropriate to now compare the means pairwise. Included on the next page is the Minitab output
for Tukey’s method (which adjusts for multiple
comparisons, and has a family error rate of 5%) and for Fisher’s method (which
doesn’t adjust for multiple comparisons). Not adjusting for multiple
comparisons (Fisher’s method) is appropriate only if you’re simply exploring the data looking for interesting
effects to investigate in another experiment. Notice that, in this case,
regardless of whether we adjust for multiple comparisons we see a significant difference in average bone density between the
control group and the high-jump group and between the low-jump group and
high-jump group (but not between the control group and the low-jump group).
Tukey 95% Simultaneous
Confidence Intervals
All Pairwise Comparisons among Levels of Treatment
Individual
confidence level = 98.04%
Treatment
= 1-Control subtracted from:
Treatment
2-Low
Jump -12.56 11.40
35.36
(-------*-------)
3-High
Jump 13.64 37.60
61.56
(-------*-------)
-------+---------+---------+---------+--
-30 0 30 60
Treatment
= 2-Low Jump subtracted from:
Treatment
3-High
Jump 2.24 26.20
50.16
(-------*-------)
-------+---------+---------+---------+--
-30 0 30 60
Fisher 95% Individual Confidence Intervals
All Pairwise Comparisons among Levels of Treatment
Simultaneous
confidence level = 88.07%
Treatment
= 1-Control subtracted from:
Treatment
2-Low
Jump -8.41 11.40
31.21 (------*-----)
3-High
Jump 17.79 37.60
57.41
(------*-----)
-----+---------+---------+---------+----
-30 0 30 60
Treatment
= 2-Low Jump subtracted from:
Treatment
3-High
Jump 6.39 26.20
46.01
(------*-----)
-----+---------+---------+---------+----
-30 0 30 60