Math 445 Computer Lab – Basic Data Analysis and Investigation of the CLT

 

Getting the Needed Files

Double click on the My Computer icon on the desktop. Then double click on the campus_share on 'curtis' (U:)  drive and then the Class_Share folder. Finally, double click on the Math folder and then the math_445 folder. What you see in this folder is the Minitab file we will use in today’s lab: BodyDimensions.MPJ. (If you still have this from last lab, then you can ignore the next paragraph and simply double-click on the file in your account.)

 

As a class, we cannot access this share file (only one person can assess them at a time). Thus, you each need to copy the file to your personal account. You can do this by simply highlighting the file, then press Ctrl-C to copy the file. Now open the My Documents folder on the desktop (this is the My Documents folder of your personal account). Once you are in the My Documents folder, hit Ctrl-V to paste the file into your account. Now open the Minitab software (from the Start menu select Programs>Class Programs and then Minitab>Minitab15). Then open the file (BodyDimensions.MPJ) in Minitab: go to the File menu and choose Open Project. (You can also double-click on the file within your account; this opens the file and Minitab simultaneously.)

 

Description of BodyDimensions.MPJ

In a 2003 study, body girth (circumference) measurements (in cm) and skeletal diameter measurements (in cm), as well as age (in years), weight (in kg), height (in cm), and sex were measured on 507 physically active individuals (247 men and 260 women).

Individual Analysis

This data set is very rich, in that there are many variables and many questions to investigate. You can, for example, 1) look at single variables (graphically and numerically), 2) compare the distributions of a variable based on sex, 3) or use a scatterplot to consider the relationship between two variables. Furthermore, Suppose you want to analyze the data separately for men and women. Then from the Data menu select Split Worksheet. In the dialog box, select the Sex variable as the “By variable”. Minitab will then create two new worksheets separating the data by sex (note that it will also keep the original worksheet intact). The highlighted worksheet will be the active worksheet and it’s the active worksheet that Minitab will work with (be sure to label graphs appropriately).

 

You’ll have time in lab to investigate this data set on your own (asking me any questions you have about Minitab and/or data analysis). Remember your mind must stay active while using the statistical software. Think about how the data are set up, what research questions might be interesting, and how to best answer those research questions. At the end of your investigation time, I’ll ask each of you to share an interesting result you found and/or something you thought might be interesting, but didn’t pan out.

 

Investigation of “Large” Sample Size in Central-Limit-Theorem Applications

First we’ll consider sampling from a the uniform (0, 1) distribution. Although you’re already familiar with this distribution, Minitab can easily graph the pdf: Graph>Probability Distribution Plot>View Single; then choose the uniform distribution with lower endpoint 0 and upper endpoint 1. We can have Minitab randomly generate values from this distribution. Open a new worksheet (from the File menu select New>Minitab Worksheet). Label the first column “Uniform-Distribution Values.” Then go to the Calc menu and select Random Data>Uniform. Generate 5000 rows of data and store them in the Uniform-Distribution Values column. Graphing the values in this column gives us an estimate of what the uniform distribution looks like (that is, the sample-data distribution should look like the population distribution). Create a histogram of the Uniform-Distribution Values variable. (From the Graph menu select Histogram>Simple. Choose Uniform-Distribution Values as your “Graph variable.”) What does it look like?

 

We will consider the uniform distribution our population, and we’ll simulate repeated sampling from this population. Go back to the Calc menu and select Random Data>Uniform. Again, generate 5000 rows, but now store them in C2-C11. Now let’s think carefully about the data we have. We have 10 columns, each of which contains 5000 random draws from the uniform distribution. But we can also think about these data across rows. That is, we can think of the first row (of columns C2-C11) as a random sample of 10 draws from the uniform distribution. Then we have 5000 samples of size 10 (since we have 5000 rows of data).

 

For each sample of 10, we can calculate the sample mean. This is done by going to the Calc menu and selecting Row Statistics. Select the mean as the statistic. Then highlight variables C2-C11 in the left-hand column and select them to be the “Input variables.” In the “Store result in” box, type “Sample Means” (Minitab will then label the next open column, C12, “Sample Means,” and store the results in it). In your worksheet, scroll over to column 12. Each value in this column is a sample mean based on a sample of size 10 from the uniform distribution. Hence, a graph of these values will be an estimate of the sampling distribution of the sample mean. (Recall that this sampling distribution is the distribution of values the sample mean takes in all possible samples of size 10 from the uniform distribution).

 

Create a histogram of the Sample Means variable. (From the Graph menu select Histogram>Simple. Choose Sample Means as your “Graph variable.” You can title the graph by clicking on the Labels button.) What is the shape of this distribution? Recall you can fit a normal-distribution curve to the histogram by right-clicking on the graph and then selecting Add>Distribution Fit>Normal. Furthermore, you can more carefully assess the normality of the means by looking at a Normal-Probability Plot. Here’s the gist of a normality plot:

·         Arrange the sample-data values from smallest to largest, and record what percentile of the data each value occupies. (For example, the smallest observation in a set of 20 values is at the 1/20 = 0.05, or 5th percentile.)

 

·         Do normal-distribution calculations to find the z-scores (via reverse lookup) at these sample percentiles. (For example, z = -1.645 is the 5th percentile of the standard normal curve.)

 

·         Plot each data point x against the corresponding z. If the sample-data distribution is close to standard normal, the plotted points will lie close to the line x = z. If the data distribution is close to any normal distribution, the plotted points will lie close to some straight line. Deviations from a straight line indicate that the sample distribution deviates from normal (e.g., heavier tails, skewedness).

 

The normal-probability plots produced by Minitab use more sophisticated versions of the basic idea above, but it’s only important that you understand the basic idea. (If you want to read more about normal-probability plots, see Section 4.6 of the textbook.) To create a normal-probability plot of the Sample Means variable, select Graph>Probability Plot>Single and choose Sample Means as your graph variable (by default, Minitab will create a probability plot based on the normal distribution, but you can change this using the Distributions button). What do you think? It seems that n = 10 is a “large” enough sample size when sampling from a uniform distribution (that is, the sample means based on samples of size 10 follow a normal curve).

 

Now we’ll consider a non-symmetric population distribution. Recall the exponential distribution is positively skewed. Look at a graph of an exponential distribution with mean 5 (recall, this is the distribution of waiting times for the first occurrence of something, when we expect 5 such occurrences within a unit interval): Graph>Probability Distribution Plot>View Single; then choose the exponential distribution with scale (read: mean) 5 and threshold 0 (something we’ve not discussed—a non-zero threshold simply shifts the distribution up the x-axis). Clearly, this distribution is heavily skewed to the positive values.

 

Open a new worksheet and label the first column “Exponential-Distribution Values.” Then randomly generate 5000 values from the exponential (mean=5) distribution and store them in this column (Calc>Random Data>Exponential). Then create a histogram of that sample of 5000 values. As we expect, the sample-data distribution is strongly skewed to the higher values.

 

Repeat the process you used for the uniform-distribution sampling. That is, create 10 columns of 5000 randomly-generated exponential (mean=5) values. Then determine and store the row means (recall these are 5000 sample averages, each based on a sample of size 10). Investigate the normality of these sample averages using both a histogram (with normal-curve fit) and a normal probability plot. What do you think? It’s clear that n=10 is not a “large” enough sample in this case (that is, when sampling from an exponential distribution, the sample averages based on samples of size 10 do not follow a normal curve).

 

Individual Analysis

Now repeat the process (using the exponential, mean=5, distribution) for n=30, n=40, and n=50. Open a new worksheet each time, so you can look back at your work. What do you think is a good rule of thumb for “large” sample size (when applying the CLT) when sampling from an exponential distribution? (Free free to fine tune your rule of thumb—for example, you can try n=48.) I realize Minitab is a bit clunky for these kinds of simulations. Soon you will learn a new statistical package, R, that makes repeated simulations easier.