In order to do
simulations and bootrapping we will use R, which is a
language and environment for statistical computing and graphics. R is installed
on all the computers in Briggs 421. Since R is freeware, you may download your
own copy if you have a personal computer (http://cran.us.r-project.org/).
Open R through
the start-up menu on your computer (Class Programs>R). R has an interface
where you type in commands, and the computer returns responses or creates
graphics. Using R requires more work than
using a menu-driven statistical package, but it also gives you precise control
over the calculations being performed.
Basics
R can be used
as a basic calculator. For example,
try:
>
6*9+1 <ENTER>
[1]
55
>
5^4 <ENTER>
[1]
625
>
(2^4+5)/30 <ENTER>
[1]
0.7
If you strike
the <ENTER> key before typing a complete expression, you
see the continuation prompt, the
plus sign (+).
For example, suppose you wish to calculate 3 + 2 * (8 – 4), but
you accidentally strike the
<ENTER> key too early:
>
3+2*(8 <ENTER>
+
Finish the
expression by typing -4) after the +
+ -
4) <ENTER>
[1]
11
R has many useful mathematical functions built-in
(e.g., exp, sqrt, log,
sin, cos).
Try:
>
sqrt(25) <ENTER>
[1]
5
Let’s enter
some data. Listed below are annual salaries (in thousands of dollars) for a
random sample of 30 CEOs of small
1103
282 350 350 406
155 536 726
736 291 808
332
217 339 291
250 21 573
362 242 343
204
198
862 370 750
262 390 250
388
You can create a vector of the data using the c
function and the assignment operator (<-):
>
CEOsalary <- c(1103, 282, 350, 350,…,388)
To see what is
in the variable CEOsalary, just type the name of the
variable:
>
CEOsalary
[1] 1103
282 350 350 406
155 536 726
736 291 808
332 217 339
291
[16] 250
21 573 362
242 343 204
198 862 370
750 262 390
250 388
To find the sample average salary use the mean
function:
>
mean(CEOsalary)
[1]
412.9
To find the sample median salary use the mean
function:
>
median(CEOsalary)
[1]
346.5
To find the standard deviation of the CEO salaries
use the sd function:
>
sd(CEOsalary)
[1]
243.6439
To create a histogram of the CEO salaries use the hist function:
>
hist(CEOsalary)
How would you
describe the shape of the distribution? Notice the graph isn’t presentation
quality (not descriptively titled or labeled on the x-axis). To learn more
about the hist function use the help
command:
>
help(hist)
Now create a better
documented histogram:
>
hist(CEOsalary,
main="Annual Salaries for 30 CEOs of Small US Firms",xlab="Annual
Salary (in thousands of dollars)")
The arrow keys
on the keyboard can be used to recall
previous commands. The vertical arrow keys can be used to scroll forward
and backward through command history. (Try this.) Once a command is located in
this way, the cursor can be moved within the command using the horizontal arrow
keys, and characters can be removed with the delete key or added with the other
keys. This can sometimes save you the
time of repeatedly typing similar commands.
The first
number in a vector has index 1, the second has index 2, and so on. To pick one of
the CEO salaries, you can specify its index using square brackets:
>
CEOsalary[3]
[1]
350
Changing
topics, suppose we want to obtain the 97.5th percentile for a t-distribution with 25 degrees of
freedom. First, look at the R help for the t-distribution:
>
help(rt)
The help
indicates that we should use the quantile function:
>
qt(.975,25)
[1]
2.059539
As an
introductory example, we’ll consider a case where we do actually have theory to
guide us (then we can compare our bootstrap results with the theoretical
results). Suppose we take a random sample of size 100 from a normal population
with mean 50 and standard deviation 10. We are interested in the behavior of
the sample mean. Through theory, we know the sample mean will follow a normal
distribution with mean 50 and standard error
. We’ll also use bootstrap methods to
investigate the behavior of the sample mean.
First, create
your sample of size 100:
>
x <- rnorm(100, mean=50, sd=10)
This is your
original sample of data. (You can create a histogram of the sample, using hist(x), to see if it’s a good representation
of the population).
Now you will
write a function to generate B
bootstrap samples, calculating and saving the mean of each bootstrap sample.
First investigate the sample
function in R (we’ll need this as part of our bootstrap program):
>
help(sample)
Now name your
function:
> bootstrapmean <-
function(){}
Then edit your
function through the editor (if you type
in your function directly in the prompt menu, you are not allowed to correct
mistakes on previous lines, and you can easily lose work):
> bootstrapmean <-
edit(bootstrapmean)
Now type in the
following function, exactly as shown below. (You need not type in my comments,
which follow the pound signs; I included them so you’d know what the program is
doing.) Ask questions if you don’t understand what the program is doing.
function(B){ # B is the number of bootstrap samples
bootmeans <- NULL # initializing the vector “bootmeans”
for
(i in 1:B){ # this loop does the same
thing for each of the B bootstrap samples
bootsamp <-
sample(x, 100, replace=TRUE) # sample 100 values with replacement from the vector x
bootmeans[i] <- mean(bootsamp) # calculate the mean of the
bootstrap sample and store it in bootmeans
}
bootmeans # return the vector of
bootstrapped means
}
When you are
finished, simply exit out of the editor. The program will then ask if you want
to save changes; say “yes” to this. Important
note: if there is a syntax error in your program, R will let you know; in
order to get back to edit your program, you must
type bootstrapmean <- edit() at the prompt (otherwise you may lose your work).
Notice this
function is specific to sampling 100 values from your specific x vector (i.e., x and 100 are “hard coded” into
the function). You could also write a function that takes more things as
arguments (number of bootstrap samples, vector of values, and sample size for
each bootstrap sample). That is, you can be creative and efficient with your
programming; in fact, I encourage you to “play around” with R. Note you have
some computer-science gurus in the class. Raj, Yannan,
and Peter are your go-to guys.
Now run the
function using B = 1000 (i.e., 1000 bootstrap samples—this is the
number suggested for confidence intervals). Be
sure to save the results somewhere, otherwise R will simply output the results
and you won’t have the ability to manipulate them:
> bootmeans1000 <- bootstrapmean(1000)
Create a
histogram of the 1000 bootstrapped means (hist(bootmeans1000)). Does the distribution look normal,
centered around 50?
Based on theory, we can create a 95% confidence interval for the population mean:
![]()
For my original
sample, the sample mean is
>
mean(x)
[1]
50.78915
#
Note that your mean will be different, as you have a different sample
Hence, my
confidence interval is about 48.83 to 52.75:
>
mean(x)-1.96
[1]
48.82915
>
mean(x)+1.96
[1]
52.74915
You can use the percentiles of your
bootstrapped means to create a confidence interval for the population mean. That is, the lower endpoint of the
confidence interval will be the 2.5th percentile of the bootstrapped
means and the upper endpoint will be the 97.5th percentile of the
bootstrapped means. First investigate the quantile
function in R:
>
help(quantile)
Then determine
your specific bootstrap confidence interval for the mean salary. (The c
function in R simply combines values into a vector. The second argument in the quantile function is the percentiles we want. Hence, we can
combine those, within the quantile function, using
the c function.) In my case:
>
quantile(bootmeans1000, c(0.025,.975))
2.5%
97.5%
48.95161
52.80932
Note that the bootstrap confidence
interval is very close to the theoretical confidence interval.
Bootstrap Confidence Interval for the
Median
Now reconsider
the CEO salaries. Recall the histogram of salaries was skewed toward positive
values. Hence, the mean is perhaps not the best measure of center or typical
value. Instead, we should use the median.
Suppose we want
to estimate the median annual salary of all
CEOs of small firms in the
Write a program
to create B bootstrapped medians.
First name your function:
> bootstrapmedian <-
function(){}
Then edit your
function through the editor:
> bootstrapmedian <-
edit(bootstrapmedian)
Now type in the
following function (or create your own function that doesn’t “hard code” the CEOsalary vector and sample size of 30):
function(B){
bootmedians <- NULL
for
(i in 1:B){
bootsamp <-
sample(CEOsalary, 30, replace=TRUE)
bootmedians[i] <- median(bootsamp)
}
bootmedians
}
Run the program
for B = 1000 (and save the results to a variable, say “bootmedians1000”).
Look at a histogram of the estimates. Furthermore, determine the standard
deviation of the estimates (our estimate of the standard error of the median)
and create a 95% bootstrap confidence interval for the population median. (Note: look back in the handout if you’ve
forgotten how to run a program, create a histogram, calculate a standard
deviation, or create a bootstrap confidence interval).
Final Notes
·
In general, R is space-insensitive.
>
3 +4
>
3+ 4
>
mean(3+5)
>
mean ( 3 + 5 )
BUT,
the assignment operator must not have spaces (<- is different
from < -).
·
R has extensive help documentation. From the
Help menu you can view the FAQ (Frequently Asked Questions) on R. Furthermore,
you can access the help manuals, including “An Introduction to R,” which is an
excellent introductory guide.
·
You can copy a graph from R and paste it into
Word (for assignments or reports).
·
To
see the list of all objects in your workspace:
>
ls()
[1]
"bootmeans1000"
"bootmedians1000" "bootstrapmean" "bootstrapmedian"
[5]
"CEOsalary" "x"
·
Note
that the vector x
is still included in your workspace. You no longer need these data, so it is a
good idea to remove the vector (keeping your workspace as clean as possible):
>
rm(x)
>
ls()
[1]
"bootmeans1000"
"bootmedians1000" "bootstrapmean" "bootstrapmedian"
[5]
"CEOsalary"
·
To quit, type
>
q()
You will be given the option to save the workspace image. You don’t necessarily need to save the work from today’s lab, but you definitely want to save your work for homework assignments (so you can go back to the data and functions if need be). Note: If you save your workspace to a particular directory, then you can access it the next time you use R through the File>Change dir option.