In order to do
simulations and bootstrapping we will use R, which is a language and
environment for statistical computing and graphics. R is installed on all the
computers in Briggs 421. Since R is freeware, you may download your own copy if
you have a personal computer (http://cran.us.r-project.org/).
Open R through the
start-up menu on your computer (Class Programs>R). R has an interface where
you type in commands, and the computer returns responses or creates graphics. Using R requires more work than using a
menu-driven statistical package, but it also gives you precise control over the
calculations being performed.
Basics
R can be used as a basic calculator. For example, try:
>
6*9+1 <ENTER>
[1]
55
>
5^4 <ENTER>
[1]
625
>
(2^4+5)/30 <ENTER>
[1]
0.7
If you strike the <ENTER>
key before typing a complete expression, you see the continuation prompt,
the
plus sign (+). For example, suppose you wish to
calculate 3 + 2 * (8 – 4), but you accidentally strike the
<ENTER> key too early:
>
3+2*(8 <ENTER>
+
Finish the expression by
typing -4) after the +
+ -
4) <ENTER>
[1]
11
R has many useful mathematical functions built-in
(e.g., exp, sqrt, log, sin, cos). Try:
>
sqrt(25)
<ENTER>
[1]
5
Let’s enter in some data.
Listed below are annual salaries (in thousands of dollars) for a random sample
of 30 CEOs of small
1103
282 350 350 406
155 536 726
736 291 808
332
217 339 291 250
21 573 362
242 343 204
198
862 370 750
262 390 250
388
You can create a vector of the data using the c
function and the assignment operator (<-):
>
CEOsalary <- c(1103, 282,
350, 350,…,388)
To see what is in the
variable CEOsalary, just type the name of the
variable:
>
CEOsalary
[1] 1103
282 350 350 406
155 536 726
736 291 808
332 217 339
291
[16] 250
21 573 362
242 343 204
198 862 370
750 262 390
250 388
To find the sample average salary use the mean
function:
>
mean(CEOsalary)
[1]
412.9
To find the standard deviation of the CEO salaries
use the sd
function:
>
sd(CEOsalary)
[1]
243.6439
To create a histogram of the CEO salaries use the hist function:
>
hist(CEOsalary)
Notice that the graph
isn’t presentation quality (not descriptively titled or labeled on the x-axis).
To learn more about the hist function use the help command:
>
help(hist)
Now create a better documented
histogram:
>
hist(CEOsalary, main="Annual Salaries for 30 CEOs of Small
US Firms",xlab="Annual Salary (in thousands
of dollars)")
The arrow keys on the
keyboard can be used to recall previous
commands. The vertical arrow keys can be used to scroll forward and
backward through command history. (Try this.) Once a command is located in this
way, the cursor can be moved within the command using the horizontal arrow
keys, and characters can be removed with the delete key or added with the other
keys. This can sometimes save you the time of repeatedly typing similar
commands.
The first number in a
vector has index 1, the second has index 2, and so on. To pick one of the CEO
salaries, you can specify its index using square brackets:
>
CEOsalary[3]
[1]
350
Suppose we want to obtain
the 97.5th percentile for a t-distribution
with 25 degrees of freedom. First, get help for the t-distribution:
>
help(rt)
The help indicates that
we should use the quantile function:
>
qt(.975,25)
[1]
2.059539
Other Notes
·
In general, R is space-insensitive.
>
3 +4
>
3+ 4
>
mean(3+5)
>
mean ( 3 + 5 )
BUT,
the assignment operator must not have spaces (<- is different from <
-).
·
R has extensive help documentation. From the Help menu
you can view the FAQ (Frequently Asked Questions) on R. Furthermore, you can
access the help manuals, including “An Introduction to R,” which is an
excellent introductory guide.
·
To see the
list of all objects in your workspace:
>
ls()
[1]
"bootmeans1000"
"bootmedians1000" "bootstrapmean" "bootstrapmedian"
[5]
"CEOsalary" "x"
·
Note that the
vector x is still
included in your workspace. You no longer need these data, so it is a good idea
to remove the vector (keeping your workspace as clean as possible):
>
rm(x)
>
ls()
[1]
"bootmeans1000"
"bootmedians1000" "bootstrapmean" "bootstrapmedian"
[5]
"CEOsalary"
·
To quit, type
>
q()
As an introductory
example, we’ll consider a case where we do actually have theory to guide us
(then we can compare our bootstrap results with the theoretical results).
Suppose we take a random sample of size 100 from a normal population with mean
50 and standard deviation 10. We are interested in the behavior of the sample
mean. Through theory, we know the sample mean will follow a normal distribution
with mean 50 and standard error
. We’ll also use bootstrap methods to investigate the
behavior of the sample mean.
First, create your sample
of size 100:
>
x <- rnorm(100, mean=50, sd=10)
This is your original
sample of data. (You can create a histogram of the sample, using hist(x), to see if it’s a good representation of the
population).
Now you will write a
function to generate B bootstrap
samples, calculating and saving the mean of each bootstrap sample. First investigate the sample
function in R (we’ll need this as part of our bootstrap program):
>
help(sample)
Now name your function:
> bootstrapmean
<- function(){}
Then edit your function
through the editor (if you type in your function directly in the prompt menu,
you are not allowed to correct mistakes on previous lines, and you can easily
lose work):
> bootstrapmean
<- edit(bootstrapmean)
Now
type in the following function, exactly as shown below. (You need not type in my comments, which follow
the pound signs; I included them so you’d know what the program is doing.)
function(B){ # B is the number of bootstrap
samples
bootmeans <- NULL # initializing the vector “bootmeans”
for
(i in 1:B){ # this loop does the same
thing for each of the B bootstrap samples
bootsamp
<- sample(x, 100, replace=TRUE) # sample 100 values with
replacement from the vector x
bootmeans[i] <- mean(bootsamp)
# calculate the
mean of the bootstrap sample and store it in bootmeans
}
bootmeans #
return the vector of bootstrapped means
}
When you are finished,
simply exit out of the editor. The program will then ask
if you want to save changes; say “yes” to this. Important note: if there is a syntax error in your program, R will
let you know; in order to get back to edit your program, you must type bootstrapmean <- edit() at the prompt (otherwise you may lose your work).
Notice this function is
specific to sampling 100 values from your specific x vector (i.e., x and 100 are “hard coded” into
the function). You could also write a function that takes more things as
arguments (number of bootstrap samples, vector of values, and sample size for
each bootstrap sample).
Now run the function
using B = 1000 (i.e., 1000 bootstrap samples—this is the number suggested for
confidence intervals). Be sure to save the results somewhere, otherwise R will
simply output the results and you won’t have the ability to manipulate them:
> bootmeans1000 <- bootstrap(1000)
Create a histogram of the
1000 bootstrapped means (hist(bootmeans1000)) and calculate the standard deviation of the 1000
bootstrapped means (sd(bootmeans1000)).
Based on theory, we can
create a 95% confidence interval for the population mean: ![]()
For my original sample,
the sample mean is
>
mean(x)
[1]
50.78915
#
Note that your mean will be different, as you have a different sample
Hence, my confidence
interval is about 48.83 to 52.75:
>
mean(x)-1.96
[1]
48.82915
>
mean(x)+1.96
[1]
52.74915
You can use the
percentiles of your bootstrapped means to create a confidence interval for the
population mean. That is, the lower endpoint of the confidence interval will be
the 2.5th percentile of the bootstrapped means and the upper
endpoint will be the 97.5th percentile of the bootstrapped means. First
investigate the quantile function in R:
>
help(quantile)
Then determine your
specific bootstrap confidence interval for the median salary. In my case:
>
quantile(bootmeans1000,
c(0.025,.975))
2.5%
97.5%
48.95161
52.80932
Note that the bootstrap
confidence interval is very close to the theoretical confidence interval.
Bootstrap Confidence Interval for the Median
Now let’s reconsider the
CEO salaries. Recall that the histogram of salaries was skewed toward positive
values. Hence, the mean is perhaps not the best measure of center or typical
value. Instead, we should use the median.
Suppose we want to
estimate the median annual salary of all
CEOs of small firms in the
Write a program to create
B bootstrapped medians. First name
your function:
> bootstrapmedian
<- function(){}
Then edit your function
through the editor:
> bootstrapmedian
<- edit(bootstrapmedian)
Now type in the following
function (or create your own fucntion that doesn’t
“hard code” the CEOsalary vector and sample size of
30):
function(B){
bootmedians <- NULL
for
(i in 1:B){
bootsamp
<- sample(CEOsalary, 30, replace=TRUE)
bootmedians[i] <- median(bootsamp)
}
bootmedians
}
Run the program for B = 1000 (and save the results
somewhere). Look at a histogram of the estimates. Furthermore, determine the
standard deviation of the estimates (our estimate of the standard error of the
median) and create a 95% bootstrap confidence interval for the population
median. (Note: look back in the
handout if you’ve forgotten how to run a program, create a histogram, calculate
a standard deviation, or create a bootstrap confidence interval).