Double click on
the My Computer icon on the desktop. Then double click on the campus_share on 'curtis' (U:) drive and then the Class_Share
folder. Finally, double click on the Math
folder and then the math_445 folder. What you see in this folder is the
Minitab file we will use in today’s lab: BodyDimensions.MPJ. (If you still have
this from last lab, then you can ignore the next paragraph and simply double-click
on the file in your account.)
As a class, we
cannot access this share file (only one person can assess them at a time).
Thus, you each need to copy the file to your personal account. You can do this
by simply highlighting the file, then press Ctrl-C to copy the file. Now open
the My Documents folder on the desktop (this is the My Documents
folder of your personal account). Once you are in the My Documents
folder, hit Ctrl-V to paste the file into your account. Now open the Minitab
software (from the Start menu select Programs>Class Programs
and then Minitab>Minitab15). Then open the file (BodyDimensions.MPJ)
in Minitab: go to the File menu and
choose Open Project. (You can also
double-click on the file within your account; this opens the file and Minitab
simultaneously.)
In
a 2003 study, body girth (circumference) measurements (in cm) and skeletal
diameter measurements (in cm), as well as age (in years), weight (in kg),
height (in cm), and sex were measured on 507 physically active individuals (247
men and 260 women).
This data set
is very rich, in that there are many variables and many questions to
investigate. You can, for example, 1) look at single variables (graphically and
numerically), 2) compare the distributions of a variable based on sex, 3) or
use a scatterplot to consider the relationship
between two variables. Furthermore, Suppose you want to analyze the data separately for men and women. Then from the Data
menu select Split Worksheet. In the dialog box, select the Sex
variable as the “By variable”. Minitab will then create two new worksheets
separating the data by sex (note that it will also keep the original worksheet
intact). The highlighted worksheet will be the active worksheet and it’s the
active worksheet that Minitab will work with (be sure to label graphs
appropriately).
You’ll have
time in lab to investigate this data set on your own (asking me any questions
you have about Minitab and/or data analysis). Remember your mind must stay
active while using the statistical software. Think about how the data are set
up, what research questions might be interesting, and how to best answer those
research questions. At the end of your
investigation time, I’ll ask each of you to share an interesting result you
found and/or something you thought might be interesting, but didn’t pan out.
Investigation of
“Large” Sample Size in Central-Limit-Theorem Applications
First we’ll consider
sampling from a the uniform (0, 1) distribution. Although you’re already
familiar with this distribution, Minitab can easily graph the pdf: Graph>Probability
Distribution Plot>View Single; then choose the uniform distribution with
lower endpoint 0 and upper endpoint 1. We can have Minitab randomly generate
values from this distribution. Open a new worksheet (from the File menu select New>Minitab Worksheet). Label the first column “Uniform-Distribution
Values.” Then go to the Calc menu and
select Random Data>Uniform.
Generate 5000 rows of data and store them in the Uniform-Distribution Values column. Graphing the values in this
column gives us an estimate of what the uniform distribution looks like (that
is, the sample-data distribution should
look like the population distribution). Create a histogram of the Uniform-Distribution Values variable.
(From the Graph menu select Histogram>Simple. Choose Uniform-Distribution Values as your
“Graph variable.”) What does it look like?
We will
consider the uniform distribution our population, and we’ll simulate repeated
sampling from this population. Go back to the Calc menu and select Random
Data>Uniform. Again, generate 5000 rows, but now store them in C2-C11.
Now let’s think carefully about the data we have. We have 10 columns, each of
which contains 5000 random draws from the uniform distribution. But we can also
think about these data across rows. That is, we can think of the first row (of
columns C2-C11) as a random sample of 10 draws from the uniform distribution.
Then we have 5000 samples of size 10 (since we have 5000 rows of data).
For each sample
of 10, we can calculate the sample mean. This is done by going to the Calc menu and selecting Row Statistics. Select the mean as the
statistic. Then highlight variables C2-C11 in the left-hand column and select
them to be the “Input variables.” In the “Store result in” box, type “Sample
Means” (Minitab will then label the next open column, C12, “Sample Means,” and
store the results in it). In your worksheet, scroll over to column 12. Each
value in this column is a sample mean based on a sample of size 10 from the
uniform distribution. Hence, a graph of these values will be an estimate of the
sampling distribution of the sample mean. (Recall that this sampling
distribution is the distribution of values the sample mean takes in all
possible samples of size 10 from the uniform distribution).
Create a
histogram of the Sample Means
variable. (From the Graph menu select
Histogram>Simple. Choose Sample Means as your “Graph variable.”
You can title the graph by clicking on the Labels
button.) What is the shape of this distribution? Recall you can fit a
normal-distribution curve to the histogram by right-clicking on the graph and
then selecting Add>Distribution
Fit>Normal. Furthermore, you can more carefully assess the normality of
the means by looking at a Normal-Probability
Plot. Here’s the gist of a normality plot:
·
Arrange
the sample-data values from smallest to largest, and record what percentile of
the data each value occupies. (For example, the smallest observation in a set
of 20 values is at the 1/20 = 0.05, or 5th percentile.)
·
Do
normal-distribution calculations to find the z-scores (via reverse lookup) at
these sample percentiles. (For example, z = -1.645 is the 5th
percentile of the standard normal curve.)
·
Plot
each data point x against the
corresponding z. If the sample-data distribution is close to standard normal,
the plotted points will lie close to the line x = z. If the data distribution is close to any normal
distribution, the plotted points will lie close to some straight line.
Deviations from a straight line indicate that the sample distribution deviates
from normal (e.g., heavier tails, skewedness).
The
normal-probability plots produced by Minitab use more sophisticated versions of
the basic idea above, but it’s only important that you understand the basic
idea. (If you want to read more about
normal-probability plots, see Section 4.6 of the textbook.) To create a
normal-probability plot of the Sample
Means variable, select Graph>Probability
Plot>Single and choose Sample
Means as your graph variable (by default, Minitab will create a probability
plot based on the normal distribution, but you can change this using the Distributions button). What do you think? It seems that n = 10 is a “large” enough
sample size when sampling from a uniform distribution (that is, the sample
means based on samples of size 10 follow a normal curve).
Now we’ll
consider a non-symmetric population distribution. Recall the exponential
distribution is positively skewed. Look at a graph of an exponential
distribution with mean 5 (recall, this is the distribution of waiting times for
the first occurrence of something, when we expect 5 such occurrences within a
unit interval): Graph>Probability
Distribution Plot>View Single; then choose the exponential distribution
with scale (read: mean) 5 and threshold 0 (something we’ve not discussed—a
non-zero threshold simply shifts the distribution up the x-axis). Clearly, this distribution is heavily skewed to the
positive values.
Open a new
worksheet and label the first column “Exponential-Distribution Values.” Then randomly
generate 5000 values from the exponential (mean=5) distribution and store them
in this column (Calc>Random
Data>Exponential). Then create a histogram of that sample of 5000
values. As we expect, the sample-data distribution is strongly skewed to the
higher values.
Repeat the
process you used for the uniform-distribution sampling. That is, create 10
columns of 5000 randomly-generated exponential (mean=5) values. Then determine
and store the row means (recall these are 5000 sample averages, each based on a
sample of size 10). Investigate the normality of these sample averages using
both a histogram (with normal-curve fit) and a normal probability plot. What do
you think? It’s clear that n=10 is
not a “large” enough sample in this case (that is, when sampling from an exponential distribution, the sample averages
based on samples of size 10 do not follow a normal curve).
Individual Analysis
Now repeat the
process (using the exponential, mean=5, distribution) for n=30, n=40, and n=50. Open a new worksheet each time, so
you can look back at your work. What do you think is a good rule of thumb for
“large” sample size (when applying the CLT) when sampling from an exponential
distribution? (Free free to fine tune your rule of thumb—for example, you can
try n=48.) I realize Minitab is a bit
clunky for these kinds of simulations. Soon you will learn a new statistical
package, R, that makes repeated simulations easier.