Math 117 Computer Lab – One-Variable Graphics and Numerical Summaries

 

Getting the Needed Files

Double click on the My Computer icon on the desktop. Then double click on the campus_share on 'curtis' (U:)  drive and then the Class_Share folder. Finally, double click on the Math folder and then the Math_117 folder. What you see in this folder are the SPSS files we will use in today’s lab: CountryData.sav, DrinkingStudy.sav, FacultySalaries.sav, and Oil.sav.

 

As a class, we cannot access these share files (only one person can assess them at a time). Thus, you each need to copy the four files to your personal account. You can do this by simply highlighting all the files (click on the first one, then shift-click on the last one—this should highlight them all). Then press Ctrl-C to copy the files. Now open the My Documents folder on the desktop (this is the My Documents folder of your personal account). Once you are in the My Documents folder, hit Ctrl-V to paste the four files into your account.

 

Now open the SPSS software (from the Start menu select Programs>Class Programs and then SPSS Inc>PASW Statistics 17). From the File menu select Open>Data, then change the folder to My Documents and open DrinkingStudy.sav.

 

Description of DrinkingStudy.sav

In 1994 the Harvard School of Public Health published a college alcohol study. Samples of students from 140 four-year colleges were asked questions about their alcohol consumption (demographic information was also collected). The 10,904 responses are included in this data set. The students answered questions based on their behavior in the last 30 days (e.g., they recorded how often they drove after drinking within the last 30 days).

 

Analysis

Note the spreadsheet has both a “Data View” and a “Variable View.” Click on the “Variable View” folder tab to find out additional information about the variables (e.g., more detailed descriptions than the limited 10-character variable names.) We will use frequency tables, pie charts, and bar charts to summarize the categorical variables. Suppose we simply want to know the proportion of males and females in the sample (in numbers, not a graph). To get a frequency table go to the Analyze menu and select Descriptive Statistics>Frequencies. Move the Sex variable into the Variable(s) box. Then click on the OK button. The results are shown in the Output window. Are there an equal number of males and females in the sample? How might this affect the rest of our analysis?

 

To create a pie chart go to the Graphs menu and select Legacy Dialogs>Interactive>Pie>Simple (in general, it is always better to use the interactive graphs—they are more presentation-quality and can be edited—on your own time, you might also want to play around with the Chart Builder option). A dialog box will appear. From the left column drag the DrinkDrive variable (“Driven after drinking alcohol?”)  into the Slice by: box. Note that the default slice summary is the count, but you can change it to percent by simply dragging the Percent variable from the left column and dropping it in the Slice summary box (although a pie chart with pies representing counts will look the same as a pie chart with pies representing percentages). There are folder tabs at the top of the dialog box (currently, you are on the Assign Variables tab). Click on the Pies tab and you can select how the slices are labeled (it’s nice to have numerical summaries—counts and percents—to accompany the graph). Click on the Titles tab and give the graph an appropriate title (this is very important!). The pie chart will appear in the Output window. If you double click on the pie chart you will invoke the graph editor. Note you can change things like text, font, slice labels, etc.

 

To create a bar chart, from the Graphs menu select Legacy Dialogs>Interactive>Bar. In the dialog box drag the DrinkRide variable (“Rode with a high/drunk driver?”) to the box on the horizontal axis. Note the default is for the bar heights to be the counts, but you can have the bar heights be percents by dragging the Percent variable to the box on the vertical axis. From the Bar Chart Options tab you can change the bar labels. Be sure to title your graph (via the Titles folder tab).

 

Next week we’ll look at the relationships between variables in this data set. (For example, do males and females have different drinking habits?)

Description of FacultySalaries.sav

A faculty salary study was done at The Ohio State University to compare faculty salaries with those at other universities. Data were collected from the Association of American Universities. The overall average salary (in thousands of dollars) for OSU was obtained by computing the weighted average of salaries at each faculty level with the weights being the proportion of faculty members in each rank. The overall average salary for each of the other 49 universities was created using the same weights OSU used.

 

The dataset contains information on whether a university belongs to the Committee on Institutional Cooperation (a consortium of collaborating research universities), an overall average salary (in thousands), and the average full, associate, and assistant professor salaries (in thousands).

 

Analysis

To make a histogram of the Salary variable by select Legacy Dialogs>Interactive>Histogram from the Graphs menu. In the dialog box, drag the Salary variable to the box on the x-axis. Note that the default is a frequency histogram, but you can create a percent histogram by dragging the Percent variable to the box on the y-axis. Click on the Histogram tab and notice you could change the intervals or overlay a normal curve. Be sure to title your graph (via the Titles folder tab). How would you describe the shape of the average salary distribution? (In class, we might go back to the histogram dialog box to change the number intervals, in order the smooth the distribution.)

 

Suppose you would like separate salary histograms for CIC and non-CIC institutions. Then in the histogram dialog box, drag the CICstatus (Belong to the CIC) variable into the Panel Variables box. This will create the two different histograms and place them on the same scale (note you should choose percent, rather than frequency, histograms since the two groups are of different sizes). What differences or similarities do you notice in the distributions?

 

To get descriptive statistics, select Descriptive Statistics>Frequencies from the Analyze menu (note you can also select Descriptive Statistics>Descriptives, but then you don’t have as many options for the descriptive statistics to be calculated). In the dialog box, highlight the Salary variable and then hit the right arrow button to move it to the variables box. Now click on the Statistics button and select the statistics you want (choose the 5-number summary, mean, and standard deviation). If you click on the Charts button you can also have a histogram created. Finally, click on the box to the left of Display frequency tables so that it is no longer checked. The Salary variable is quantitative, not categorical, so it makes no sense to show the frequencies. Do these descriptive statistics (shown in the Output window) corroborate your description of the salary distribution based on the histogram?

 

If you are interested in a first exploration of a variable or variables via graphs and descriptive statistics, then select Descriptive Statistics>Explore from the Analyze menu. Move the FullSal, AssocSal, and AsstSal variables all to the Dependent List box. Click on the Statistics box and select Percentiles as well as descriptives (otherwise SPSS won’t output the quartiles). Click on the Plots button and select both “Stem-and-Leaf” and “Histogram” (note that this is only place in SPSS where you can get a stem-and-leaf plot). Also, by default SPSS creates boxplots of these variables. Because of how these data are set up (three separate variables), you should select Dependents together for the boxplots. Through the explore option, a lot of information is show in the Output window. The average salaries for the different faculty ranks can be compared via numerical summaries and graphs (note the histograms are not on the same scale, so you need to be careful when making comparisons). Note there are outliers shown in the boxplots and these observations are identified by their row numbers, so you can easily look up which colleges they represent. How do the salary distributions compare between faculty ranks? (Use the boxplots, histograms, and numerical summaries to answer this question.)

 

If you want descriptive statistics for a quantitative variable, but separate statistics based on a second, categorical variable, then you must use the Descriptive Statistics>Explore option. For example, move the Salary variable to the Dependent List (quantitative variable) and then move the CICstatus to the Factor List (categorical variable). Note from the Plots button you should select Factor levels together for the boxplots.

 

To create boxplots directly, select Legacy Dialogs>Interactive>Boxplot from the Graphs menu. From this dialog box, you can create a single boxplot (one quantitative variable on the vertical axis and no variables on the horizontal axis) or comparative boxplots (one quantitative variable on the vertical axis and a categorical variable on the horizontal axis). To create comparative boxplots of average salaries for full professors at CIC institutions versus non-CIC institutions, drag the FullSal variable (the quantitative variable) to the vertical axis and the CICstatus variable (the categorical variable) to the horizontal axis. Just to the left of the “2-D Coordinate” button is a button that will turn the boxplots horizontal (rather than vertical). Click this button if you prefer horizontal boxplots. If you click on the Boxes tab and you can change the display. Be sure to title your graph (via the Titles folder tab). Notice the boxplot for the average salary of full professors at CIC institutions shows two flagged outliers. Sometimes it’s useful to investigate the outliers, and therefore it’s nice to know where those particular values are in the spreadsheet. (Note: SPSS labels points outliers and extremes—outliers are found as according to the rule we discussed in class; extremes are found by multiplying the IQR by 3 and then following the same process.) Double-click on the boxplot graph (which invokes the editor). Now right-click on a suspected outlier (marked with a circle), and choose Show Data Labels from the drop-down menu. A new dialog box will appear. By default, SPSS will show the CIC-status for the outliers (which is not meaningful). Move the Case number up to be displayed and get rid of the CIC-status. Now the outliers are marked by its row in the spreadsheet, so it can easily be investigated. (Note: You can also mark the outliers by the name of the institution, but you must go back to the boxplot dialog box and drag the University variable to the Label cases by box.)

 

Description of Oil.sav

This data file includes the annual world crude oil production in millions of barrels, 1880-1984.

 

Analysis

Time is obviously a potential factor related to oil production, so we should initially consider time in our analysis (if, after the analysis, it appears that time has no impact on oil production, then we can ignore it and use a simple one-variable graph, like a histogram). To create a time series plot of oil production variable, select Legacy Dialogs>Interactive>Line from the Graph menu. In the dialog box, move Oil to the vertical axis and Year to the horizontal axis. Now go to the Dots and Lines folder tab, and check the box for Display: Dots (this will show the data points as dots—otherwise SPSS draws a smooth curve that doesn’t give the whole picture). Finally, title the graph through the Labels button. Does it appear that time is a factor in oil production? (In class, if we have time, we’ll discuss how to make the time series plot fill the entire space of the graph.)

 

 

Additional Notes