MATH 217: Bootstrap Methods and Permutation Tests
Bootstrap Method
Suppose high school
seniors take a standardized math exam (maximum score of 100—no units, because
it is standardized). We’d like to know the average score for the population of
high school seniors. We have a random sample of only 6 scores:
|
Observation Number |
1 |
2 |
3 |
4 |
5 |
6 |
|
Exam Score |
69.3 |
90.6 |
79.3 |
71.0 |
82.7 |
93.6 |
In this situation, if we
can assume the population of all standardized exam scores follows a normal
distribution (or close to normal distribution), we have theory to guide us: the
Central Limit Theorem (in practice, not in the limit) tells us the sampling
distribution of the sample average will be approximately normal. (Note this
assumes the distribution of the original population is close to normal.)
But what about a
different sample statistic, say the sample range? The
Central Limit Theorem only tells us about averages and totals, not about other
possible statistics. What if we have an applied statistics problem where we are
interested in the sampling distribution of the sample range? (That is, if all
samples of size 6 were taken from this population what would be the
distribution—possible values and number of occurrences—of the sample range?)
Bootstrap methods are now
frequently used in statistical analysis (mainly because computer technology has
advanced to meet the needs of computer-intensive statistical methods). Here’s
the basic idea: 1) We treat our
sample of data as our whole population (note this assumes the sample is a good
representation of the population); 2)
then we resample (with replacement)
from our original sample—these sets of re-sampled data are called bootstrap
samples; 3) for each of the
bootstrap samples, we can calculate the value of the appropriate sample statistic
(e.g., mean, range); 4) based on
these bootstrapped values of the sample statistic, we graphically (e.g., via a
histogram) get an estimate of the sampling distribution of that particular
statistic, and, by calculating the
standard deviation of the bootstrapped statistic values, we get an estimate of
the standard error of our sample statistic.
Hence, we don’t know the
distribution of the population, but we “pull ourselves up by our bootstraps,”
and use our sample of data to represent the population. Then we simulate
repeated sampling (with replacement) from the population by repeatedly sampling
from the sample. This is now very easy to do via computer, but we’ll start with
a simulation by hand, using die rolls to do the bootstrap sampling (so you better
understand the idea).
Student Data Table (Bootstrap Simulation)
|
|
|
|
|
|
|
|
Boot-strapped
Sample Mean |
|
|
Bootstrap
Sample 1 |
|
|
|
|
|
|
|
|
|
Bootstrap
Sample 2 |
|
|
|
|
|
|
|
|
|
Bootstrap
Sample 3 |
|
|
|
|
|
|
|
|
|
Bootstrap
Sample 4 |
|
|
|
|
|
|
|
|
|
Bootstrap
Sample 5 |
|
|
|
|
|
|
|
|
|
Bootstrap
Sample 6 |
|
|
|
|
|
|
|
|
Typically, for a computer
simulation, we’d (easily) create 1000 boot-strapped samples. In our case, we
have 24 bootstrap samples. We’ll graph our bootstrap means and bootstrap ranges
and further discuss this process (and the limitations of the bootstrap). In
computer lab we’ll discuss confidence intervals based on the bootstrap method.
Permutation Tests
Suppose a nutritionist
created a new diet, which he thinks will help people lose weight. He has six
overweight, female volunteers for the study; he randomly assigns 3 people to
the new diet and 3 people to the control group (no change in diet). Note: We’re
keeping the numbers small so you can actually simulate this test by hand. (In general, it’s not a good idea to do
bootstrap or permutation methods on very small sets of data.) After 6
months, the weight losses (in pounds) are shown in the table below.
|
Treatment Group (New Diet) |
Control Group (Usual Diet) |
||||
|
15 |
9 |
10 |
4 |
10 |
5 |
The sample average for
the treatment group is 11.33 pounds and the sample average in the control group
is 6.33 pounds. Hence, the difference in the averages (for this particular sample) is 5 pounds.
We could compare these
groups via a two-sample t test, but recall that test requires that the two
populations follow normal distributions (which can sometimes be difficult to
verify and other times simply isn’t true). What we really want to test is the null hypothesis that the special diet has no
effect on the distribution of weight losses. We can use re-sampling to
perform this test (in a non-parametric way). We must resample (this time, without replacement) in a way that is
consistent with the null hypothesis and the study design.
Under the null
hypothesis, we can consider all six values to come from the same distribution
(treatment and control are no different). Hence, we can sample (without
replacement) three of the six values to be our “treatment group”. Then we can
calculate the difference in means (treatment mean – control mean) for that sample.
We repeat this process and we have many different re-sampled differences in
means (assuming the null hypothesis is true). Once we have this distribution of
differences in the means, we can compare our particular value, 5 pounds, to all
the possible values. If it’s highly unlikely, then we have reason to reject the
hypothesis that the special diet has no effect on the distribution of weight
losses.
We’ll do this by hand in
this exercise. Note that a computer can easily and quickly generate many, many
re-sampled differences in sample means. Then the p-value of the test is simply
the proportion of re-sampled differences that are more extreme than our
particular difference in sample means.
|
|
Treatment Group (New Diet) |
Control Group (Usual Diet) |
Treatment
Mean |
Control
Mean |
Difference
in Means |
||||
|
Resample
1 |
|
|
|
|
|
|
|
|
|
|
Resample
2 |
|
|
|
|
|
|
|
|
|
|
Resample
3 |
|
|
|
|
|
|
|
|
|
|
Resample
4 |
|
|
|
|
|
|
|
|
|
|
Resample
5 |
|
|
|
|
|
|
|
|
|
We’ll graph our 20
re-sampled differences in means and then see where the observed value, 5
pounds, falls. (Note there are actually only 20 possible combinations of
treatment and control groups. We haven’t taken care to make sure we have all 20
samples—we probably have a few repeats—but this is close to the full sampling
sitribution.)