General Idea of Bootstrap Methods
Bootstrap
methods are now frequently used in statistical analysis (mainly because
computer technology has advanced to meet the needs of computer-intensive
statistical methods). Suppose you have a random sample from a population where
the parameters and distribution of the population are unknown. You are
interested in estimating a parameter,
, of the population, and you determine
an estimator,
, based on the sample data. In order to
do inference, it’s important to know the behavior of an estimator (e.g., its
distribution and standard error). In some cases, we have theory (e.g., the
Central Limit Theorem) to tell us the behavior of an estimator; in other cases,
we don’t. What if we don’t have any statistical theory to tell us the behavior
of
?
We can treat
our sample of data as our whole population (note this assumes the sample is a
good representation of the population). Then we can resample (with replacement) from our original
sample—these sets of re-sampled data are called bootstrap samples. For each of
the bootstrap samples, we can calculate the value of
. Based on these bootstrapped values of
, we can graphically get an estimate of
the distribution of
, and, by calculating the standard
deviation of the
values, we can get an estimate of the standard
error of
.
Hence, we don’t
know the distribution of the population, but we “pull ourselves up by our
bootstraps,” and use our sample of data to represent the population. Then we simulate
repeated sampling (with replacement) from the population by repeatedly sampling
from the sample.
This handout
(and the course textbook) is but a brief introduction to bootstrap methods. For
more information, see An Introduction to
the Bootstrap by Efron and Tibshirani
(Chapman & Hall).
Sometimes we
have a good idea of the distribution from which our sample data come, but we
are unsure of the parameters. We can estimate the parameters from our
particular sample. Then rather than re-sampling from our one sample, we
repeatedly sample from the particular distribution (with estimated parameters).
This is the bootstrap method mentioned at the end of Section 7.1 in our course
textbook. Although the parametric bootstrap is not used as often in practice (in
most cases we don’t know the distribution from which our data come), it is easy
to perform using Minitab. Hence, in this initial look at the bootstrap, we’ll
use the parametric method. (Later we’ll implement the non-parametric bootstrap
using the statistical package R, within which we can write appropriate
simulation programs.)
As an
introductory example, consider a case where we do actually have theory to guide
us (then we can compare our bootstrap results with the theoretical results).
Suppose we take a random sample of size 100 from a normal population with mean
50 and standard deviation 10. We are interested in the behavior of the sample
mean. Through theory, we know the sample mean,
, has a normal distribution with mean 50
and standard error
. Do we get a similar result using parametric
bootstrap methods?
Using Minitab, first
create your original random sample. From the Calc menu choose Random
Data>
Now suppose
this sample of 100 values is our initial random sample. We believe these data
come from a normal distribution, but pretend we don’t know the mean and
standard deviation. (Note, you can look at a histogram of your data to see how
well your sample “represents” a normal distribution—remember the
bootstrap-method depends highly on how representative the sample is of the
population.) We can estimate the mean and standard deviation from the
respective sample values. From the Stat
menu choose Basic
Statistics>Descriptive Statistics, and select the Sample Data column as your variable. Each of us will obtain a
different sample mean and sample standard deviation, and we’ll use these
estimates for the rest of the exercise.
From the Calc menu, again select Random Data>
Theoretical Results Known—Behavior of
the Sample Mean
Now each of us
has 500 bootstrap samples of size 100. Label column c102 “Bootstrap Means.”
Then from the Calc menu select Row Statistics, and calculate the means
for c2-c101 (and store the means in the column you just named “Bootstrap
Means”). Now create a histogram and determine descriptive statistics (mean and
standard deviation) for the column of bootstrapped means. Does the behavior of
the bootstrapped means agree with what the Central Limit Theorem tells us
theoretically?
Theoretical Results Unknown—Behavior of
the Sample Standard Deviation
Now suppose we
are interested in the sample standard deviation as an estimator of the
population standard deviation. We want to know the behavior (distribution,
mean, standard error) of the sample standard deviation. In this case, we don’t
have theory to guide us.
We already have
500 bootstrap samples from our original sample data. Now we simply need to
calculate the sample standard deviation for each of these samples. Label column
103 “Bootstrap Standard Deviations.” Then determine (using Calc>Row Statistics) the standard deviations for the samples in
columns 1-101 (store the standard deviations in the column you just labeled).
Finally, create a histogram of these bootstrapped sample standard deviations
(this gives you an estimate of the sampling distribution of the standard
deviation). Also, determine descriptive statistics for the bootstrap standard
deviations (the value of the standard deviation for our bootstrap estimates
tells us how precise our estimator is—it’s an estimate of standard error of the
sample standard deviation).