Double click on
the My Computer icon on the desktop. Then double click on the campus_share on 'curtis' (U:) drive and then the Class_Share
folder. Finally, double click on the Math
folder and then the Math_117 folder.
What you see in this folder are the SPSS files we will use in today’s lab: CountryData.sav,
DrinkingStudy.sav, FacultySalaries.sav, and Oil.sav.
As a class, we
cannot access these share files (only one person can assess them at a time).
Thus, you each need to copy the four files to your personal account. You can do
this by simply highlighting all the files (click on the first one, then
shift-click on the last one—this should highlight them all). Then press Ctrl-C
to copy the files. Now open the My Documents folder on the desktop (this
is the My Documents folder of your personal account). Once you are in
the My Documents folder, hit Ctrl-V to paste the four files into your
account.
Now open the SPSS
software (from the Start menu select Programs>Class Programs
and then SPSS Inc>PASW Statistics 17). From the File menu
select Open>Data, then change the folder to My Documents and
open DrinkingStudy.sav.
In 1994 the
Harvard School of Public Health published a college alcohol study. Samples of
students from 140 four-year colleges were asked questions about their alcohol
consumption (demographic information was also collected). The 10,904 responses
are included in this data set. The students answered questions based on their
behavior in the last 30 days (e.g.,
they recorded how often they drove after drinking within the last 30 days).
Note the
spreadsheet has both a “Data View” and a “Variable View.” Click on the “Variable
View” folder tab to find out additional information about the variables (e.g., more detailed descriptions than
the limited 10-character variable names.) We will use frequency tables, pie
charts, and bar charts to summarize the categorical variables. Suppose we
simply want to know the proportion of males and females in the sample (in
numbers, not a graph). To get a frequency
table go to the Analyze menu and
select Descriptive Statistics>Frequencies.
Move the Sex variable into the Variable(s) box. Then click on the OK button. The results are shown in the Output
window. Are there an equal number of males and females in the sample? How might
this affect the rest of our analysis?
To create a pie
chart go to the Graphs menu and
select Legacy Dialogs>Interactive>Pie>Simple (in general, it is always better to use the
interactive graphs—they are more presentation-quality and can be edited—on your
own time, you might also want to play around with the Chart Builder option).
A dialog box will appear. From the left column drag the DrinkDrive
variable (“Driven after drinking alcohol?”) into the Slice by: box. Note that the
default slice summary is the count, but you can change it to percent by simply
dragging the Percent variable from the left column and dropping it in
the Slice summary box (although a pie chart with pies representing
counts will look the same as a pie chart with pies representing percentages).
There are folder tabs at the top of the dialog box (currently, you are on the Assign
Variables tab). Click on the Pies tab and you can select how the
slices are labeled (it’s nice to have numerical summaries—counts and
percents—to accompany the graph). Click on the Titles tab and give the
graph an appropriate title (this is very important!). The pie chart will appear
in the Output window. If you double click on the pie chart you will invoke the graph editor. Note you can change
things like text, font, slice labels, etc.
To create a bar
chart, from the Graphs menu
select Legacy Dialogs>Interactive>Bar. In the dialog
box drag the DrinkRide variable (“Rode with a
high/drunk driver?”) to the box on the horizontal axis. Note the default is for
the bar heights to be the counts, but you can have the bar heights be percents
by dragging the Percent variable to the box on the vertical axis. From the Bar Chart
Options tab you can change the bar labels. Be sure to title your graph (via
the Titles folder tab).
Next week we’ll
look at the relationships between variables in this data set. (For example, do
males and females have different drinking habits?)
A faculty
salary study was done at The Ohio State University to compare faculty salaries
with those at other universities. Data were collected from the Association of
American Universities. The overall average salary (in thousands of dollars) for
OSU was obtained by computing the weighted average of salaries at each faculty
level with the weights being the proportion of faculty members in each rank.
The overall average salary for each of the other 49 universities was created
using the same weights OSU used.
The dataset
contains information on whether a university belongs to the Committee on
Institutional Cooperation (a consortium of collaborating research universities),
an overall average salary (in thousands), and the average full, associate, and
assistant professor salaries (in thousands).
To make a histogram
of the Salary variable by select Legacy
Dialogs>Interactive>Histogram from the Graphs menu. In the dialog box, drag the Salary variable to
the box on the x-axis. Note that the default is a frequency histogram,
but you can create a percent histogram by dragging the Percent variable
to the box on the y-axis. Click on the Histogram tab and notice you
could change the intervals or overlay a normal curve. Be sure to title your
graph (via the Titles folder tab).
How would you describe the shape of the average salary distribution? (In class,
we might go back to the histogram dialog box to change the number intervals, in
order the smooth the distribution.)
Suppose you
would like separate salary histograms
for CIC and non-CIC institutions. Then in the histogram dialog box, drag the CICstatus (Belong
to the CIC) variable into the Panel Variables box. This will
create the two different histograms and place them on the same scale (note you
should choose percent, rather than frequency, histograms since the two groups
are of different sizes). What differences or similarities do you notice in the
distributions?
To get descriptive
statistics, select Descriptive Statistics>Frequencies from the Analyze menu (note you can also select Descriptive
Statistics>Descriptives, but then you don’t
have as many options for the descriptive statistics to be calculated). In the
dialog box, highlight the Salary variable and then hit the right arrow
button to move it to the variables box. Now click on the Statistics
button and select the statistics you want (choose the 5-number summary, mean,
and standard deviation). If you click on the Charts button you can also
have a histogram created. Finally, click on the box to the left of Display
frequency tables so that it is no longer checked. The Salary variable
is quantitative, not categorical, so it makes no sense to show the frequencies.
Do these descriptive statistics (shown in the Output window) corroborate your
description of the salary distribution based on the histogram?
If you are
interested in a first exploration of a
variable or variables via graphs and descriptive statistics, then select Descriptive
Statistics>Explore from the Analyze
menu. Move the FullSal,
AssocSal,
and AsstSal
variables all to the Dependent List box. Click on the Statistics
box and select Percentiles as well as descriptives
(otherwise SPSS won’t output the quartiles). Click on the Plots button
and select both “Stem-and-Leaf” and “Histogram” (note that this is only place
in SPSS where you can get a stem-and-leaf plot). Also, by default SPSS
creates boxplots
of these variables. Because of how these data are set up (three separate
variables), you should select Dependents
together for the boxplots. Through the explore
option, a lot of information is show in the Output window. The average salaries
for the different faculty ranks can be compared via numerical summaries and
graphs (note the histograms are not on the same scale, so you need to be
careful when making comparisons). Note there are outliers shown in the boxplots and these
observations are identified by their row numbers, so you can easily look up
which colleges they represent. How do the salary distributions compare between
faculty ranks? (Use the boxplots, histograms, and
numerical summaries to answer this question.)
If you want descriptive statistics for a quantitative
variable, but separate statistics based on a second, categorical variable,
then you must use the Descriptive Statistics>Explore option. For example, move the Salary
variable to the Dependent List (quantitative variable) and then move the
CICstatus to the Factor List (categorical
variable). Note from the Plots button you should select Factor levels
together for the boxplots.
To create boxplots directly, select Legacy
Dialogs>Interactive>Boxplot from the Graphs menu. From this dialog box, you can create a single boxplot
(one quantitative variable on the vertical axis and no variables on the
horizontal axis) or comparative boxplots (one quantitative variable on the vertical
axis and a categorical variable on the horizontal axis). To create comparative boxplots of average salaries for full professors at CIC
institutions versus non-CIC institutions, drag the FullSal
variable (the quantitative variable) to the vertical axis and the CICstatus variable (the categorical variable) to the horizontal
axis. Just to the left of the “2-D Coordinate” button is a button that will
turn the boxplots horizontal (rather than vertical).
Click this button if you prefer horizontal boxplots. If
you click on the Boxes tab and you can change the display. Be sure to
title your graph (via the Titles
folder tab). Notice the boxplot for the average salary
of full professors at CIC institutions shows two flagged outliers. Sometimes
it’s useful to investigate the outliers,
and therefore it’s nice to know where those particular values are in the
spreadsheet. (Note: SPSS labels
points outliers and extremes—outliers are found as according to the rule we
discussed in class; extremes are found by multiplying the IQR by 3 and then
following the same process.) Double-click on the boxplot
graph (which invokes the editor). Now right-click on a suspected outlier (marked
with a circle), and choose Show Data
Labels from the drop-down menu. A new dialog box will appear. By default,
SPSS will show the CIC-status for the outliers (which is not meaningful). Move
the Case number up to be displayed
and get rid of the CIC-status. Now the
outliers are marked by its row in the spreadsheet, so it can easily be
investigated. (Note: You can also
mark the outliers by the name of the institution, but you must go back to the boxplot dialog box and drag the University variable to the Label
cases by box.)
This data file
includes the annual world crude oil production in millions of barrels,
1880-1984.
Time is obviously
a potential factor related to oil production, so we should initially consider
time in our analysis (if, after the analysis, it appears that time has no
impact on oil production, then we can ignore it and use a simple one-variable
graph, like a histogram). To create a time series plot of oil production
variable, select Legacy Dialogs>Interactive>Line from the Graph
menu. In the dialog box, move Oil to the vertical axis and Year to the horizontal axis. Now go to
the Dots and Lines folder tab, and
check the box for Display: Dots (this
will show the data points as dots—otherwise SPSS draws a smooth curve that
doesn’t give the whole picture). Finally, title the graph through the Labels button. Does it appear that time
is a factor in oil production? (In class, if we have time, we’ll discuss how to
make the time series plot fill the entire space of the graph.)
Additional Notes