Double click on
the My Computer icon on the desktop. Then double click on the campus_share on 'curtis' (U:) drive and then the Class_Share
folder. Finally, double click on the Math
folder and then the math_117 folder.
What you see in this folder are the SPSS files we’ll use in today’s lab: CountryData.sav,
DrinkingStudy.sav, FacultySalaries.sav, and Oil.sav.
As a class, we
cannot access these share files (only one person can assess them at a time).
Thus, you each need to copy the four files to your personal account. You can do
this by simply highlighting all the files (click on the first one, then
shift-click on the last one—this should highlight them all). Then press Ctrl-C
to copy the files. Now open the My Documents folder on the desktop (this
is the My Documents folder of your personal account). Once you are in
the My Documents folder, hit Ctrl-V to paste the four files into your
account.
Now open the SPSS
software (from the Start menu select Programs>Class Programs
and then SPSS Inc>PASW Statistics 17). From the File menu
select Open>Data, then change the folder to My Documents and
open DrinkingStudy.sav.
In 1994 the
Harvard School of Public Health published a college alcohol study. Samples of
students from 140 four-year colleges were asked questions about their alcohol
consumption (demographic information was also collected). The 10,904 responses
are included in this data set. The students answered questions based on their
behavior in the last 30 days (e.g.,
they recorded how often they drove after drinking within the last 30 days).
Note the spreadsheet has both a “Data
View” and a “Variable View.”
Click on the “Variable View” folder tab to find additional information about
the variables (e.g., more detailed
descriptions than the limited 10-character variable names.) We will use frequency
tables, pie charts, and bar charts to summarize the categorical variables. Suppose
we simply want to know the proportion of males and females in the sample (in
numbers, not a graph). To get a frequency
table go to the Analyze menu and
select Descriptive Statistics>Frequencies.
Move the Sex variable into the Variable(s) box. Then click on the OK button. The results are shown in the Output
window. Are there an equal number of males and females in the sample? How might
this affect the rest of our analysis?
To create a pie
chart go to the Graphs menu and select Legacy Dialogs>Interactive>Pie>Simple (in general, it
is always better to use the interactive graphs—they are more
presentation-quality and can be edited—on your own time, you might also want to
experiment with the Chart Builder option). A dialog box will appear. From
the left column drag the DrinkDrive variable (“Driven after drinking alcohol?”) into the Slice by: box. Note that the default slice summary is the
count, but you can change it to percent by simply dragging the Percent variable from the left column
and dropping it in the Slice summary
box (although a pie chart with pies representing counts will look the same as a
pie chart with pies representing percentages). There are folder tabs at the top
of the dialog box (currently, you are on the Assign Variables tab). Click on the Pies tab and you can select how the slices are labeled (it’s
nice to have numerical summaries—counts and percents—to accompany the graph).
Click on the Titles tab and give
the graph an appropriate title (this is
very important!). The pie chart will appear in the Output window. What does
this graph tell us (in one sentence)? If you double click on the pie chart you
will invoke the graph editor. Note
you can change things like text, font, slice labels, etc.
To create a bar
chart, from the Graphs menu select Legacy Dialogs>Interactive>Bar. In the dialog box
drag the DrinkRide
variable (“Rode with a high/drunk driver?”) to the box on the horizontal axis.
Note the default is for the bar heights to be the counts, but you can have the
bar heights be percents by dragging the Percent
variable to the box on the vertical axis.
From the Bar Chart Options tab
you can change the bar labels. Be sure to title your graph (via the Titles folder
tab).
Next week we’ll look at the
relationships between variables in this data set. (For example, do males and
females have different drinking habits?)
A faculty
salary study was done at The Ohio State University to compare faculty salaries
with those at other universities. Data were collected from the Association of
American Universities. The overall average salary (in thousands of dollars) for
OSU was obtained by computing the weighted average of salaries at each faculty
level with the weights being the proportion of faculty members in each rank. The
overall average salary for each of the other 49 universities was created using
the same weights OSU used.
The dataset
contains information on whether a university belongs to the Committee on
Institutional Cooperation (a consortium of collaborating research universities),
an overall average salary (in thousands), and the average full, associate, and
assistant professor salaries (in thousands). Note these are average salaries for each university, not salaries for
individual faculty members.
To make a histogram
of the Salary variable
select Legacy Dialogs>Interactive>Histogram
from the Graphs menu. In the dialog box, drag the Salary variable to the box on the x-axis. Note the
default is a frequency histogram, but you can create a percent histogram by
dragging the Percent variable
to the box on the y-axis. Click on the Histogram tab and notice you could change the intervals or
overlay a normal curve (these changes shouldn’t be implemented until we see the
initial graph). Be sure to title your graph (via the Titles folder tab). How
would you describe the shape of the average salary distribution? (In class, we
might go back to the histogram dialog box to change the number intervals, in
order the smooth the distribution.)
Suppose you
would like salary histograms for the different
levels of a categorical variable (CIC and non-CIC institutions). Then in
the histogram dialog box, drag the CICstatus (Belong to the CIC) variable into the Panel Variables box. This will create the two different
histograms and place them on the same scale (note you should choose percent, rather than frequency, histograms since
the two groups are of different sizes). What differences or similarities do
you notice in the distributions?
To get descriptive
statistics, select Descriptive
Statistics>Frequencies from the Analyze menu (note you can also
select Descriptive Statistics>Descriptives, but then you don’t have as many
options for the descriptive statistics to be calculated—for example, the
5-number summary is not provided). In the dialog box, highlight the Salary variable and then hit the
right arrow button to move it to the variables box. Now click on the Statistics button and select the
statistics you want (choose the 5-number summary, mean, and standard deviation).
If you click on the Charts
button you can also have a histogram created. Finally, click on the box to the
left of Display frequency tables
so that it is no longer checked. The Salary variable is quantitative, not
categorical, so it makes no sense to show the frequencies. Do these
descriptive statistics (shown in the Output window) corroborate your
description of the salary distribution based on the histogram?
If you are
interested in a first exploration of a
variable or variables via graphs and descriptive statistics, then select Descriptive Statistics>Explore from
the Analyze menu. Move the FullSal, AssocSal, and AsstSal variables
all to the Dependent List box.
Click on the Statistics box and
select Percentiles as well as descriptives (otherwise
SPSS won’t output the quartiles). Click on the Plots button and select both “Stem-and-Leaf” and “Histogram”
(note that this is only place in SPSS where you can get a stem-and-leaf
plot). Also, by default SPSS creates boxplots of these variables. Because of how these data are set up (three separate variables), you
should select Dependents together for the boxplots.
Through the explore option, a lot of information is show in the Output window.
The average salaries for the different faculty ranks can be compared via
numerical summaries and graphs (note the histograms are not on the same scale,
so you must be careful when making comparisons). There are outliers shown in the boxplots and these
observations are identified by their row numbers, so you can easily look up
which colleges they represent. How do the salary distributions compare between
faculty ranks? (Use the boxplots, histograms, and
numerical summaries to answer this question.)
If you want descriptive statistics for a quantitative
variable, but separate statistics based on a second, categorical variable,
then you must use the Descriptive
Statistics>Explore option. For example, move the Salary variable to
the Dependent List (quantitative variable) and then move the CICstatus to the Factor List (categorical variable).
Note from the Plots button you should select Factor levels together for the boxplots.
To create boxplots directly, select Legacy Dialogs>Interactive>Boxplot from the Graphs
menu. From this dialog box, you can create a single boxplot (one quantitative variable
on the vertical axis and no variables on the horizontal axis) or comparative boxplots
(one quantitative variable on the vertical axis and a categorical variable on
the horizontal axis). To create comparative boxplots
of average salaries for full professors at CIC institutions versus non-CIC
institutions, drag the FullSal variable (the quantitative variable) to the vertical axis
and the CICstatus
variable (the categorical variable) to the horizontal axis. Just to the left of
the “2-D Coordinate” button is a button that will turn the boxplots
horizontal (rather than vertical). Click this button if you prefer horizontal boxplots. If you click on the Boxes tab and you can change the display. Be sure to
title your graph (via the Titles folder tab). Notice the boxplot
for the average salary of full professors at CIC institutions shows two flagged
outliers. Sometimes it’s useful to investigate
the outliers, and therefore it’s nice to know where those particular values
are in the spreadsheet. (Note: SPSS
labels points outliers and extremes—outliers are found as according to the rule
we discussed in class; extremes are found by multiplying the IQR by 3 and then
following the same process.) Double-click on the boxplot
graph (which invokes the editor). Now right-click on a suspected outlier
(marked with a circle), and choose Show Data Labels from the drop-down menu. A new dialog box will appear. By default,
SPSS shows the CIC-status for the outliers (which is not meaningful). Move the Case
number up to be displayed and get rid of the CIC-status. Now the outliers are
marked by its row in the spreadsheet, so it can easily be investigated. (Note: You can also mark the outliers by
the name of the institution, but you must go back to the boxplot
dialog box and drag the University variable to the Label cases by box.)
This data file
includes the annual world crude oil production in millions of barrels,
1880-1984.
Time is obviously a potential factor
related to oil production, so we should initially consider time in our analysis
(if, after the analysis, it appears that time has no impact on oil production,
then we can ignore it and use a simple one-variable graph, like a histogram). To create a time series plot of
oil production variable, select Legacy Dialogs>Interactive>Line from the Graph menu. In the dialog box, move Oil to the vertical axis and Year to the horizontal axis. Now go
to the Dots and Lines folder tab, and check the box for Display: Dots (this
will show the data points as dots—otherwise SPSS draws a smooth curve that
doesn’t give the whole picture). Finally, title the graph through the Labels
button. Does it appear that time is a factor in oil production? (In class, if
we have time, we’ll discuss how to make the time series plot fill the entire
space of the graph—invoke the graph editor, then choose Edit>Select X Axis,
and change the maximum to 1984)
Important Additional Notes