Math 217 Computer Lab: Introduction to R and Bootstrap Confidence Intervals

 

In order to do simulations and bootstrapping we will use R, which is a language and environment for statistical computing and graphics. R is installed on all the computers in Briggs 421. Since R is freeware, you may download your own copy if you have a personal computer (http://cran.us.r-project.org/).

 

Open R through the start-up menu on your computer (Class Programs>R). R has an interface where you type in commands, and the computer returns responses or creates graphics. Using R requires more work than using a menu-driven statistical package, but it also gives you precise control over the calculations being performed.

 

Basics

R can be used as a basic calculator. For example, try:

> 6*9+1 <ENTER>

[1] 55

> 5^4 <ENTER>

[1] 625

> (2^4+5)/30 <ENTER>

[1] 0.7

 

If you strike the <ENTER> key before typing a complete expression, you see the continuation prompt, the

plus sign (+). For example, suppose you wish to calculate 3 + 2 * (8 – 4), but you accidentally strike the

<ENTER> key too early:

> 3+2*(8 <ENTER>

+

Finish the expression by typing -4) after the +

+ - 4) <ENTER>

[1] 11

 

R has many useful mathematical functions built-in (e.g., exp, sqrt, log, sin, cos). Try:

> sqrt(25) <ENTER>

[1] 5

 

Let’s enter in some data. Listed below are annual salaries (in thousands of dollars) for a random sample of 30 CEOs of small US business firms.

 

   1103    282    350    350    406    155    536    726    736    291    808

    332    217    339    291    250     21    573    362    242    343    204

    198    862    370    750    262    390    250    388

 

You can create a vector of the data using the c function and the assignment operator (<-):

> CEOsalary <- c(1103, 282, 350, 350,…,388)

 

To see what is in the variable CEOsalary, just type the name of the variable:

> CEOsalary

 [1] 1103  282  350  350  406  155  536  726  736  291  808  332  217  339  291

[16]  250   21  573  362  242  343  204  198  862  370  750  262  390  250  388

 

To find the sample average salary use the mean function:

> mean(CEOsalary)

[1] 412.9

 

To find the standard deviation of the CEO salaries use the sd function:

> sd(CEOsalary)

[1] 243.6439

 

To create a histogram of the CEO salaries use the hist function:

> hist(CEOsalary)

 

Notice that the graph isn’t presentation quality (not descriptively titled or labeled on the x-axis). To learn more about the hist function use the help command:

> help(hist)

 

Now create a better documented histogram:

> hist(CEOsalary, main="Annual Salaries for 30 CEOs of Small US Firms",xlab="Annual Salary (in thousands of dollars)")

 

The arrow keys on the keyboard can be used to recall previous commands. The vertical arrow keys can be used to scroll forward and backward through command history. (Try this.) Once a command is located in this way, the cursor can be moved within the command using the horizontal arrow keys, and characters can be removed with the delete key or added with the other keys. This can sometimes save you the time of repeatedly typing similar commands.

 

The first number in a vector has index 1, the second has index 2, and so on. To pick one of the CEO salaries, you can specify its index using square brackets:

> CEOsalary[3]

[1] 350

 

Suppose we want to obtain the 97.5th percentile for a t-distribution with 25 degrees of freedom. First, get help for the t-distribution:

> help(rt)

 

The help indicates that we should use the quantile function:

> qt(.975,25)

[1] 2.059539

 

 

Other Notes

·            In general, R is space-insensitive.

> 3 +4

> 3+ 4

> mean(3+5)

> mean ( 3 + 5 )

 

BUT, the assignment operator must not have spaces (<- is different from < -).

 

·            R has extensive help documentation. From the Help menu you can view the FAQ (Frequently Asked Questions) on R. Furthermore, you can access the help manuals, including “An Introduction to R,” which is an excellent introductory guide.

 

·            To see the list of all objects in your workspace:

> ls()

[1] "bootmeans1000"   "bootmedians1000" "bootstrapmean"   "bootstrapmedian"

[5] "CEOsalary"       "x"   

 

·            Note that the vector x is still included in your workspace. You no longer need these data, so it is a good idea to remove the vector (keeping your workspace as clean as possible):

> rm(x)

> ls()

[1] "bootmeans1000"   "bootmedians1000" "bootstrapmean"   "bootstrapmedian"

[5] "CEOsalary"

 

·            To quit, type

> q()

 


Bootstrap Confidence Interval for the Mean (and Comparison to Theory)

As an introductory example, we’ll consider a case where we do actually have theory to guide us (then we can compare our bootstrap results with the theoretical results). Suppose we take a random sample of size 100 from a normal population with mean 50 and standard deviation 10. We are interested in the behavior of the sample mean. Through theory, we know the sample mean will follow a normal distribution with mean 50 and standard error . We’ll also use bootstrap methods to investigate the behavior of the sample mean.

 

 

First, create your sample of size 100:

> x <- rnorm(100, mean=50, sd=10)

 

This is your original sample of data. (You can create a histogram of the sample, using hist(x), to see if it’s a good representation of the population).

 

Now you will write a function to generate B bootstrap samples, calculating and saving the mean of each bootstrap sample. First investigate the sample function in R (we’ll need this as part of our bootstrap program):

> help(sample)

 

Now name your function:

> bootstrapmean <- function(){}

 

Then edit your function through the editor (if you type in your function directly in the prompt menu, you are not allowed to correct mistakes on previous lines, and you can easily lose work):

> bootstrapmean <- edit(bootstrapmean)

 

Now type in the following function, exactly as shown below. (You need not type in my comments, which follow the pound signs; I included them so you’d know what the program is doing.)

 

function(B){            # B is the number of bootstrap samples

bootmeans <- NULL       # initializing the vector “bootmeans

for (i in 1:B){         # this loop does the same thing for each of the B bootstrap samples

  bootsamp <- sample(x, 100, replace=TRUE)  # sample 100 values with replacement from the vector x

  bootmeans[i] <- mean(bootsamp) # calculate the mean of the bootstrap sample and store it in bootmeans

  }

bootmeans               # return the vector of bootstrapped means

}

 

When you are finished, simply exit out of the editor. The program will then ask if you want to save changes; say “yes” to this. Important note: if there is a syntax error in your program, R will let you know; in order to get back to edit your program, you must type bootstrapmean <- edit() at the prompt (otherwise you may lose your work).

 

Notice this function is specific to sampling 100 values from your specific x vector (i.e., x and 100 are “hard coded” into the function). You could also write a function that takes more things as arguments (number of bootstrap samples, vector of values, and sample size for each bootstrap sample).

 

Now run the function using B = 1000 (i.e., 1000 bootstrap samples—this is the number suggested for confidence intervals). Be sure to save the results somewhere, otherwise R will simply output the results and you won’t have the ability to manipulate them:

> bootmeans1000 <- bootstrap(1000)

 

Create a histogram of the 1000 bootstrapped means (hist(bootmeans1000)) and calculate the standard deviation of the 1000 bootstrapped means (sd(bootmeans1000)).

 

 

Based on theory, we can create a 95% confidence interval for the population mean:

 

For my original sample, the sample mean is

> mean(x)

[1] 50.78915

# Note that your mean will be different, as you have a different sample

 

Hence, my confidence interval is about 48.83 to 52.75:

> mean(x)-1.96

[1] 48.82915

> mean(x)+1.96

[1] 52.74915

 

You can use the percentiles of your bootstrapped means to create a confidence interval for the population mean. That is, the lower endpoint of the confidence interval will be the 2.5th percentile of the bootstrapped means and the upper endpoint will be the 97.5th percentile of the bootstrapped means. First investigate the quantile function in R:

> help(quantile)

 

Then determine your specific bootstrap confidence interval for the median salary. In my case:

> quantile(bootmeans1000, c(0.025,.975))

 

    2.5%    97.5%

48.95161 52.80932

 

Note that the bootstrap confidence interval is very close to the theoretical confidence interval.

 

Bootstrap Confidence Interval for the Median

Now let’s reconsider the CEO salaries. Recall that the histogram of salaries was skewed toward positive values. Hence, the mean is perhaps not the best measure of center or typical value. Instead, we should use the median.

 

Suppose we want to estimate the median annual salary of all CEOs of small firms in the US. We can do this estimation with a confidence interval, but we don’t have theory to guide us. So we can determine a bootstrap confidence interval.

 

Write a program to create B bootstrapped medians. First name your function:

> bootstrapmedian <- function(){}

 

Then edit your function through the editor:

> bootstrapmedian <- edit(bootstrapmedian)

 

Now type in the following function (or create your own fucntion that doesn’t “hard code” the CEOsalary vector and sample size of 30):

 

function(B){

bootmedians <- NULL

for (i in 1:B){

  bootsamp <- sample(CEOsalary, 30, replace=TRUE)

  bootmedians[i] <- median(bootsamp)

  }

bootmedians

}

 

Run the program for B = 1000 (and save the results somewhere). Look at a histogram of the estimates. Furthermore, determine the standard deviation of the estimates (our estimate of the standard error of the median) and create a 95% bootstrap confidence interval for the population median. (Note: look back in the handout if you’ve forgotten how to run a program, create a histogram, calculate a standard deviation, or create a bootstrap confidence interval).