Math 445 Computer Lab: Introduction to Bootstrapping

 

General Idea of Bootstrap Methods

Bootstrap methods are now frequently used in statistical analysis (mainly because computer technology has advanced to meet the needs of computer-intensive statistical methods). Suppose you have a random sample from a population where the parameters and distribution of the population are unknown. You are interested in estimating a parameter, , of the population, and you determine an estimator, , based on the sample data. In order to do inference, it’s important to know the behavior of an estimator (e.g., its distribution and standard error). In some cases, we have theory (e.g., the Central Limit Theorem) to tell us the behavior of an estimator; in other cases, we don’t. What if we don’t have any statistical theory to tell us the behavior of ?

 

We can treat our sample of data as our whole population (note this assumes the sample is a good representation of the population). Then we can resample (with replacement) from our original sample—these sets of re-sampled data are called bootstrap samples. For each of the bootstrap samples, we can calculate the value of . Based on these bootstrapped values of , we can graphically get an estimate of the distribution of , and, by calculating the standard deviation of the  values, we can get an estimate of the standard error of .

 

Hence, we don’t know the distribution of the population, but we “pull ourselves up by our bootstraps,” and use our sample of data to represent the population. Then we simulate repeated sampling (with replacement) from the population by repeatedly sampling from the sample.

 

This handout (and the course textbook) is but a brief introduction to bootstrap methods. For more information, see An Introduction to the Bootstrap by Efron and Tibshirani (Chapman & Hall).

 

 

Parametric Bootstrap

Sometimes we have a good idea of the distribution from which our sample data come, but we are unsure of the parameters. We can estimate the parameters from our particular sample. Then rather than re-sampling from our one sample, we repeatedly sample from the particular distribution (with estimated parameters). This is the bootstrap method mentioned at the end of Section 7.1 in our course textbook. Although the parametric bootstrap is not used as often in practice (in most cases we don’t know the distribution from which our data come), it is easy to perform using Minitab. Hence, in this initial look at the bootstrap, we’ll use the parametric method. (Later we’ll implement the non-parametric bootstrap using the statistical package R, within which we can write appropriate simulation programs.)

 

 

Parametric Bootstrap Examples (When Theoretical Results Are Known and When They Are Unknown)

As an introductory example, consider a case where we do actually have theory to guide us (then we can compare our bootstrap results with the theoretical results). Suppose we take a random sample of size 100 from a normal population with mean 50 and standard deviation 10. We are interested in the behavior of the sample mean. Through theory, we know the sample mean, , has a normal distribution with mean 50 and standard error . Do we get a similar result using parametric bootstrap methods?

 

Using Minitab, first create your original random sample. From the Calc menu choose Random Data>Normal. Generate 100 rows of data (mean 50 and standard deviation 10) and store them in the first column. Label this column “Sample Data.”

 

Now suppose this sample of 100 values is our initial random sample. We believe these data come from a normal distribution, but pretend we don’t know the mean and standard deviation. (Note, you can look at a histogram of your data to see how well your sample “represents” a normal distribution—remember the bootstrap-method depends highly on how representative the sample is of the population.) We can estimate the mean and standard deviation from the respective sample values. From the Stat menu choose Basic Statistics>Descriptive Statistics, and select the Sample Data column as your variable. Each of us will obtain a different sample mean and sample standard deviation, and we’ll use these estimates for the rest of the exercise.

 

From the Calc menu, again select Random Data>Normal. Generate 500 rows of data (this is the number of bootstrap samples) and store these data in columns c2-c101 (each row is a bootstrap sample of size 100). Use your specific sample mean and standard deviation values.

 

Theoretical Results Known—Behavior of the Sample Mean

Now each of us has 500 bootstrap samples of size 100. Label column c102 “Bootstrap Means.” Then from the Calc menu select Row Statistics, and calculate the means for c2-c101 (and store the means in the column you just named “Bootstrap Means”). Now create a histogram and determine descriptive statistics (mean and standard deviation) for the column of bootstrapped means. Does the behavior of the bootstrapped means agree with what the Central Limit Theorem tells us theoretically?

 

Theoretical Results Unknown—Behavior of the Sample Standard Deviation

Now suppose we are interested in the sample standard deviation as an estimator of the population standard deviation. We want to know the behavior (distribution, mean, standard error) of the sample standard deviation. In this case, we don’t have theory to guide us.

 

We already have 500 bootstrap samples from our original sample data. Now we simply need to calculate the sample standard deviation for each of these samples. Label column 103 “Bootstrap Standard Deviations.” Then determine (using Calc>Row Statistics) the standard deviations for the samples in columns 1-101 (store the standard deviations in the column you just labeled). Finally, create a histogram of these bootstrapped sample standard deviations (this gives you an estimate of the sampling distribution of the standard deviation). Also, determine descriptive statistics for the bootstrap standard deviations (the value of the standard deviation for our bootstrap estimates tells us how precise our estimator is—it’s an estimate of standard error of the sample standard deviation).