Math 445—Bootstrap Example

 

Suppose we are interested in the annual incomes of adults in the Fox Cities. Because we don’t have the time and money to take an accurate census, we simply take a random sample of 15 adults and record their incomes (in thousands of dollars). For this sample, the income distribution and numerical summaries are shown below.

 

 

Variable               N     Mean     StDev     Minimum       Q1     Median      Q3     Maximum

Income (in 1000s $)   15    238.6     351.3         0.0     25.0      82.0    400.0      1200.0

 

 

Inference about the Mean

Suppose we want to estimate the average salary for all adults in the Fox Cities. We can use our sample to create a confidence interval for the population mean. Since the population standard deviation is unknown, we need to use a one-sample t confidence interval. Recall, though, that this t-interval procedure includes the condition that the population being sampled is normal. It’s clear from the histogram of our sample data that the distribution of annual incomes is, in fact, not normal but very skewed. Hence, we shouldn’t use the t interval.

 

Bootstrap to the rescue! We can treat this sample of 15 incomes as our population, and resample (n=15) from it with replacement. The histogram below shows 1000 bootstrap mean values. This histogram gives us an estimate of the sampling distribution of the sample mean income (based on samples of size 15). Note that this distribution is not normal. Now our confidence interval can’t take on the standard form of estimate multiplier(standard error), because we don’t know what the multiplier is (it doesn’t come from a z table or a t table). Hence, we use the 2.5th percentile and the 97.5th percentile of the bootstrapped means to serve as our 95% confidence interval (because we have nothing else to go on—we “pull ourselves up by our bootstraps”).

 

 

For this particular bootstrap simulation the 95% confidence interval is ($96,500, $426,100). If we had (inappropriately) used the t confidence interval, our 95% interval would be ($44,000, $433,200)—notice the lower end of the two intervals is quite different. Also note that both intervals are very wide (perhaps too wide to be practically useful. When the bootstrap and the t intervals disagree significantly, this typically means the parametric conditions of the t methods are not met. This then also means we cannot trust the confidence level of the t interval.

Inference about the Median

It’s clear that the distribution of annual incomes is skewed toward the high values (this is a typical shape of salary distributions). Hence, the median, rather than the mean, is a better measure of typical salary. (This is a simple observation that is often overlooked in analyses.)

 

Suppose we want to estimate the median annual income of all Fox Cities adults. Now we have no theory at all to guide us. We don’t have a central limit theorem for medians. Hence, the bootstrap (or some other nonparametric approach) is our only option. Included below is a histogram of 1000 bootstrapped median incomes. Notice the estimated sampling distribution of the sample median is not at all normal.

 

 

We can use the standard deviation of these 1000 bootstrapped medians to estimate the standard error of the sample median: $66,800 (this gives us an idea of the precision of our estimator). Using the bootstrap percentile method, we can create a 95% confidence interval for the median annual income of all Fox Cities adults: ($25,000, $400,000). The bootstrap method allows us to determine a confidence interval when we had no theory to guide us. Unfortunately this interval is quite wide (perhaps too wide to be practically helpful?).

 

Inference about the First Quartile

Suppose now we’re most interested in the first quartile (25th percentile) of the annual salaries of all Fox Cities adults (or any of the percentiles, for that matter). Again, we have no theory to guide us. We have no idea what the sampling distribution is of the sample first quartile (there’s no central limit theorem for sample first quartiles).

 

Included below is a histogram of 1000 bootstrapped first quartile incomes. This histogram gives us an estimate of the sampling distribution of the sample first quartile. We can use the standard deviation of these 1000 bootstrapped first quartiles to estimate the standard error of the sample first quartile: $17,200 (this gives us a sense of the precision of our estimator). We can also use these values to create a bootstrap (percentile) 95% confidence interval for the first quartile income of all Fox Cities adults: ($12,000, $82,000).