MATH 217: Bootstrap Methods and Permutation Tests

 

Bootstrap Method

Suppose high school seniors take a standardized math exam (maximum score of 100—no units, because it is standardized). We’d like to know the average score for the population of high school seniors. We have a random sample of only 6 scores:

 

Observation Number

1

2

3

4

5

6

Exam Score

69.3

90.6

79.3

71.0

82.7

93.6

 

In this situation, if we can assume the population of all standardized exam scores follows a normal distribution (or close to normal distribution), we have theory to guide us: the Central Limit Theorem (in practice, not in the limit) tells us the sampling distribution of the sample average will be approximately normal. (Note this assumes the distribution of the original population is close to normal.)

 

But what about a different sample statistic, say the sample range? The Central Limit Theorem only tells us about averages and totals, not about other possible statistics. What if we have an applied statistics problem where we are interested in the sampling distribution of the sample range? (That is, if all samples of size 6 were taken from this population what would be the distribution—possible values and number of occurrences—of the sample range?)

 

Bootstrap methods are now frequently used in statistical analysis (mainly because computer technology has advanced to meet the needs of computer-intensive statistical methods). Here’s the basic idea: 1) We treat our sample of data as our whole population (note this assumes the sample is a good representation of the population); 2) then we resample (with replacement) from our original sample—these sets of re-sampled data are called bootstrap samples; 3) for each of the bootstrap samples, we can calculate the value of the appropriate sample statistic (e.g., mean, range); 4) based on these bootstrapped values of the sample statistic, we graphically (e.g., via a histogram) get an estimate of the sampling distribution of that particular statistic, and, by calculating the standard deviation of the bootstrapped statistic values, we get an estimate of the standard error of our sample statistic.

 

Hence, we don’t know the distribution of the population, but we “pull ourselves up by our bootstraps,” and use our sample of data to represent the population. Then we simulate repeated sampling (with replacement) from the population by repeatedly sampling from the sample. This is now very easy to do via computer, but we’ll start with a simulation by hand, using die rolls to do the bootstrap sampling (so you better understand the idea).

 

Student Data Table (Bootstrap Simulation)

 

 

 

 

 

 

 

Boot-strapped Sample Mean

Boot-strapped Sample Range

Bootstrap Sample 1

 

 

 

 

 

 

 

 

Bootstrap Sample 2

 

 

 

 

 

 

 

 

Bootstrap Sample 3

 

 

 

 

 

 

 

 

Bootstrap Sample 4

 

 

 

 

 

 

 

 

Bootstrap Sample 5

 

 

 

 

 

 

 

 

Bootstrap Sample 6

 

 

 

 

 

 

 

 

 

Typically, for a computer simulation, we’d (easily) create 1000 boot-strapped samples. In our case, we have 24 bootstrap samples. We’ll graph our bootstrap means and bootstrap ranges and further discuss this process (and the limitations of the bootstrap). In computer lab we’ll discuss confidence intervals based on the bootstrap method.

 


Permutation Tests

Suppose a nutritionist created a new diet, which he thinks will help people lose weight. He has six overweight, female volunteers for the study; he randomly assigns 3 people to the new diet and 3 people to the control group (no change in diet). Note: We’re keeping the numbers small so you can actually simulate this test by hand. (In general, it’s not a good idea to do bootstrap or permutation methods on very small sets of data.) After 6 months, the weight losses (in pounds) are shown in the table below.

 

Treatment Group (New Diet)

Control Group (Usual Diet)

15

9

10

4

10

5

 

The sample average for the treatment group is 11.33 pounds and the sample average in the control group is 6.33 pounds. Hence, the difference in the averages (for this particular sample) is 5 pounds.

 

We could compare these groups via a two-sample t test, but recall that test requires that the two populations follow normal distributions (which can sometimes be difficult to verify and other times simply isn’t true). What we really want to test is the null hypothesis that the special diet has no effect on the distribution of weight losses. We can use re-sampling to perform this test (in a non-parametric way). We must resample (this time, without replacement) in a way that is consistent with the null hypothesis and the study design.

 

Under the null hypothesis, we can consider all six values to come from the same distribution (treatment and control are no different). Hence, we can sample (without replacement) three of the six values to be our “treatment group”. Then we can calculate the difference in means (treatment mean – control mean) for that sample. We repeat this process and we have many different re-sampled differences in means (assuming the null hypothesis is true). Once we have this distribution of differences in the means, we can compare our particular value, 5 pounds, to all the possible values. If it’s highly unlikely, then we have reason to reject the hypothesis that the special diet has no effect on the distribution of weight losses.

 

We’ll do this by hand in this exercise. Note that a computer can easily and quickly generate many, many re-sampled differences in sample means. Then the p-value of the test is simply the proportion of re-sampled differences that are more extreme than our particular difference in sample means.

 

 

Treatment Group (New Diet)

Control Group (Usual Diet)

Treatment Mean

Control Mean

Difference in Means

Resample 1

 

 

 

 

 

 

 

 

 

Resample 2

 

 

 

 

 

 

 

 

 

Resample 3

 

 

 

 

 

 

 

 

 

Resample 4

 

 

 

 

 

 

 

 

 

Resample 5

 

 

 

 

 

 

 

 

 

 

We’ll graph our 20 re-sampled differences in means and then see where the observed value, 5 pounds, falls. (Note there are actually only 20 possible combinations of treatment and control groups. We haven’t taken care to make sure we have all 20 samples—we probably have a few repeats—but this is close to the full sampling sitribution.)