After grading
the first exam, I want to re-iterate two important pieces of the bootstrap idea:
·
Bootstrap
methods are needed when we do not have
theory to guide us. For example, when we want to know the distribution of
the sample standard deviation or the sample first quartile (there is no Central
Limit Theorem for such statistics—the CLT applies only to averages and totals).
Many statistics do not have normal
distributions, even when based on large samples. The bootstrap methods
allow you to check the sampling distribution of any particular statistic, and
if it is normal, then you can use a t-like confidence interval based on the
bootstrap information. But more often you must use a percentile confidence
interval (completely based on the bootstrap values, not based on any theory).
·
Bootstrap
methods are applied when we have a single random sample from a population, but
no other knowledge about the population. We want to use a statistic from the
sample to estimate a parameter of the population. Hence, we want to know the sampling distribution of that particular statistic
for our particular sample size. So when we resample (with replacement), we
take bootstrap samples that are the same size
as our original sample. This then allows us access to the approximate sampling
distribution of the statistic from our original sample.
The textbook (Section
10.6) discusses both bootstrap confidence intervals and permutation tests for
two-sample and paired data. Although we will briefly discuss permutation tests,
we’ll focus on bootstrap confidence intervals. (Permutation tests are much more
coding and computer-time intensive. They are good for you to know about, but
not necessary for you to implement in this class.)
Paired Data
Data that come
from a matched-pairs design are really just one-sample data that are paired.
Hence, the bootstrap idea is the same as in the one-sample case. Reconsider the
class example about the new diet. A researcher is interested in a new diet,
which he thinks, on average, will help people lose weight. For a sample of 28
women he records their weights both before and after a diet period of 3 months.
The weight losses (before weight – after weight) for the sample are as follows:
Weight Losses (in pounds)
7
5 -5 8
13 9 17
-2 19 5
13 -8 6
9 2 1
5 11 15
19 -3 -12
-2 6 6 7
3 -3
Enter these
data into R:
>
WeightLoss <- c(7,5,-5,8,13,9,17,-2,19,5,13,-8,6,9,2,1,5,11,15,19,-3,-12,-2,6,6,7,3,-3)
Create a histogram
of these data:
>
hist(WeightLoss,
main="Weight Losses for 28 Women on a New Diet", xlab="Weight
Loss (in pounds)")
As discussed in class, the distribution of weight losses for this
sample looks roughly mound-shaped. We can investigate this further by creating
a normal-quantile plot. To do this, we must first
explore two new R functions: seq and qnorm.
Look at the help entry for each of these functions:
>
help(seq)
>
help(qnorm)
To create a normal-quantile plot for
these data, first create a corresponding vector of quantiles/percentiles
from the standard normal distribution:
> NormalQuantiles <-
qnorm(seq(1/29, 28/29,
1/29))
Then sort the weight loss data to create a vector of sample quantiles/percentiles:
> WeightLossQuantiles
<- sort(WeightLoss)
Finally, create a scatterplot of the
two sets of quantiles:
> plot(NormalQuantiles,
WeightLossQuantiles, main = "
Note that the normal-quantile plot
shows a fairly straight line, indicating that the condition that we’ve sampled
from a normal population is reasonable. Hence, the t inference procedures are
valid here and, in fact, should be used. The
bootstrap method should agree with the t method. Recall that paired-data
problems reduce to a single sample of differences. So we can use the bootstrap
confidence interval method we’ve previously discussed (for one sample).
We need to create 1000 bootstrap samples of size 28 (sampling
with replacement from the “population” of 28 weight losses). For each of these
samples, we determine the sample average weight loss. Then we have 1000 sample
average weight losses, the distribution of which gives us an estimate of the
sampling distribution of the mean weight loss. Furthermore, we can use these
1000 values to determine a percentile bootstrap confidence interval.
First name your
function:
> BootstrapMeanOneSample
<- function(){}
Then edit your
function through the editor (if you type in your function directly in the
prompt menu, you are not allowed to correct mistakes on previous lines, and you
can easily lose work):
> BootstrapMeanOneSample
<- edit(BootstrapMeanOneSample)
Now type in the
following function, exactly as shown below. (You need not type in my comments,
which follow the pound signs; I included them so you’d know what the program is
doing.) This is a more general function
than I gave out previously.
function(B,x){ # B is the number of bootstrap
samples, x is vector representing the “population”
bootmeans <- NULL # initializing the vector “bootmeans”
for
(i in 1:B){ # this loop does the same
thing for each of the B bootstrap samples
bootsamp <-
sample(x, length(x), replace=TRUE) # sample with replacement
from the vector x
bootmeans[i] <- mean(bootsamp) # calculate the mean of the
bootstrap sample and store it in bootmeans
}
bootmeans # return the vector of
bootstrapped means
}
When
you are finished, simply exit out of the editor. The program will then ask if
you want to save changes; say “yes” to this.
Important note: if there is a syntax error in your program, R will let you
know; in order to get back to edit your program, you must type BootstrapMeanOneSample <- edit() at the prompt
(otherwise you may lose your work).
Now run the
function using B = 1000 (i.e., 1000 bootstrap samples—this is the
number suggested for confidence intervals) and the WeightLoss
vector. Be sure to save the results somewhere, otherwise R will simply output
the results and you won’t have the ability to manipulate them:
> BootstrapWeightLossMeans
<- BootstrapMeanOneSample(1000,WeightLoss)
Create a
histogram of the 1000 bootstrapped means. What does it look like? In this case
(because the researcher thinks the diet will help people lose weight), we
should create a one-sided bootstrap confidence bound. To get a 95% confidence
lower bound, you simply need to determine the 5th percentile of the
bootstrapped average weight losses:
>
quantile(BootstrapWeightLossMeans,.05)
5%
2.964286
For my 1000 bootstrap values, the lower confidence bound is 3.04
pounds. That is, we are 95% confident that women lose at least 2.96 pounds on
average with this diet. Recall the lower-bound we found using the t confidence
bound was 2.87 pounds. Note the t bound
and the bootstrap bound are quite close—this is because the normality condition
was actually met in this situation.
Important
Homework Notes: 1)
Because of the nature of this particular problem, we created a one-sided
bootstrap confidence interval—in the homework you will create the usual
two-sided bootstrap percentile confidence intervals; and 2) Problem 81 gives two columns of data for each textbook—you can
enter both vectors into R and then simply subtract the vectors to get the
differences (and then proceed as we did in the weight-loss example).
Two-Sample Data
Now suppose we
have independent samples from two different populations, and we’re interested
in estimating the difference in means for these populations. Let’s
consider another weight-loss diet example. Suppose a different researcher has
developed a diet designed to help people lose weight. In this case, though, the
researcher wonders if the diet affects men and women differently, on average. He
has a sample of 10 men and 10 women. After 3 months on the diet, the two
samples show the following weights losses.
Male Weight Losses (in
pounds)
-2 6
10 12 26
25 0 20
18 9
Female Weight Losses (in
pounds)
-4 -3
0 6 9 0
23 4 12
15
Enter both of these vectors of data into R:
>
MaleWeightLoss <- c(-2,6,10,12,26,25,0,20,18,9)
>
FemaleWeightLoss <- c(-4,-3,0,6,9,0,23,4,12,15)
Create histograms for both sets of weight losses. It’s difficult
to tell with such small samples, but it appears that the normality condition is
not reasonable in this case. Hence, rather
than using the t procedures, we should use bootstrapping to create a confidence
interval for the difference in average weight losses for the two populations.
To
bootstrap in this two-sample problem, we repeatedly sample with replacement
separately from each “population” of 10 weight losses.
Within each iteration, we have a bootstrap sample from the men and a bootstrap
sample from the women. We calculate the means for these two bootstrap samples,
and then take the difference in the means. This difference in means is our
statistic of interest. Eventually we’ll
have 1000 differences in average weight loss (male average weight loss minus
female average weight loss). We can use these bootstrap differences in
averages to estimate the true difference in population mean weight losses.
First name your
function:
> BootstrapMeanTwoSample
<- function(){}
Then edit your
function through the editor (if you type in your function directly in the
prompt menu, you are not allowed to correct mistakes on previous lines, and you
can easily lose work):
> BootstrapMeanTwoSample
<- edit(BootstrapMeanTwoSample)
Now type in the
following function, exactly as shown below (please
ask questions if you have them).
function(B,x,y){
bootmeandifference <- NULL
for
(i in 1:B){
bootsampx <-
sample(x, length(x), replace=TRUE)
bootsampy <-
sample(y, length(y), replace=TRUE)
bootmeandifference[i] <- mean(bootsampx) – mean(bootsampy)
}
bootmeandifference
}
When
you are finished, simply exit out of the editor. The program will then ask if
you want to save changes; say “yes” to this. Important note: if there is a syntax error in your program, R will
let you know; in order to get back to edit your program, you must type BootstrapMeanTwoSample <- edit() at the prompt
(otherwise you may lose your work).
Now run the
function using B = 1000 (i.e., 1000 bootstrap samples—this is the
number suggested for confidence intervals) and the MaleWeightLoss
and FemaleWeightLoss vectors. Be sure to save the
results somewhere, otherwise R will simply output the results and you won’t
have the ability to manipulate them:
>
BootstrapWeightLossDifferences <- BootstrapMeanTwoSample(1000,MaleWeightLoss,FemaleWeightLoss)
Create a
histogram of the 1000 bootstrapped mean differences. What does it look like?
(So maybe it would be okay to use the two-sample t procedures? Or a t-like
confidence interval? Why?)
To create a 95%
bootstrap (percentile) confidence interval for the difference in population
mean weight losses, simply find the 2.5th and 97.5th
percentiles of your bootstrap differences:
>
quantile(BootstrapWeightLossDifferences,
c(.025,.975))
2.5%
97.5%
-1.8025
13.5050
My confidence interval for the difference in average weight
losses for men and women is (-1.8 pounds, 13.5 pounds). Note that this interval
contains 0. Hence, based on the bootstrap
confidence interval, at the 0.05 level we don’t have evidence that the average
weight loss is different for men and women.