Math 117 Computer Lab

Relationships Between Variables: Scatterplots, Correlation, and Regression

 

Getting the Needed Files

Double click on the My Computer icon on the desktop. Then double click on the campus_share on 'curtis' (U:)  drive and then the Class_Share folder. Finally, double click on the Math folder and then the math_117 folder. In this folder are (among other files) the PASW files we will use in today’s lab: BrainSize.sav, BodyMeasurements.sav, HousePrices.sav, and AlcoholMetabolism.sav.

 

As a class, we cannot access these share files (only one person can assess them at a time). Thus, you each need to copy the four files to your personal account. You can do this by simply highlighting all the files (click on the first one, then ctrl-click on the others—this should highlight them all). Then press Ctrl-C to copy the files. (We used BrainSize.sav last week, so if it’s still in your personal account, you don’t need to recopy.) Now open the My Documents folder on the desktop (this is the My Documents folder of your personal account). Once you are in the My Documents folder, hit Ctrl-V to paste the four files into your account.

 

Now open the statistical software (from the Start menu select Programs>Class Programs and then PASW Statistics 18.0). From the File menu select Open>Data, then change the folder to My Documents and open BrainSize.sav. (Alternatively, you can simply double-click on the BrainSize file in your account and PASW Statistics will automatically open.)

 

Description of BrainSize.sav

This data file contains information on IQ scores, brain size (based on total pixel count of an MRI), sex, weight, and height for 40 college students (the students are all Caucasian and right-handed and they attend a large southwestern university). Look at the Variable-View to see the detailed variable descriptions.

 

Analysis

Recall last week we considered the distribution of student heights, separately for males and females (via side-by-side boxplots, histograms, and numerical summaries).

 

Now consider the relationship between the quantitative variables in this data set. To analyze one of these relationships, the first step is to create a scatterplot. Let’s first consider the relationship between the brain size (MRI count) and the weight of these students—weight is the explanatory variable and brain size is the response. Again, use the Graphs>Chart Builder option. (Quick notes: Select basic scatterplot from the gallery; choose Brain size for the y-axis; choose Weight for the x-axis; title your graph appropriately.) How would you describe the relationship between these variables? Estimate the correlation for these data. To find the actual correlation coefficient, use the Analyze>Correlation>Bivariate option. In the dialog box, select weight and brain size as the variables (note the default correlation is “Pearson,” which is the correlation we discussed in class). How does the correlation compare with your estimate?

 

When considering weight and brain size, should the sex variable be taken into account? To create separate scatterplots for males and females, use the Graphs>Chart Builder option. There are two ways to create these scatterplots: 1) side-by-side (use the Groups>Column panel variable option) or 2) on the same graph (choose the grouped scatter option from the gallery). We’ll discuss both of these options in lab. Does it seem the relationship between brain size and body weight is different for males and females? How so?

 

We’ll investigate this difference more, but first let’s consider a different research question: can brain size predict IQ? Create a scatterplot to address this question (you can leave separate colors for males and females). Is there a relationship between brain size and IQ (either overall or separately for males and females)? What is an odd feature of this particular scatterplot?

 

Back to the body weight as a predictor of brain size (which seems much more fruitful than brain size predicting IQ). We know this relationship is different for males and females. If we want to do detailed analyses (rather than simply scatterplots with different symbols), we can analyze the data separately by sex. To do this, in the data window, select Data>Split File. Click on the “Organize output by groups” circle and move the sex variable to the box under “Groups based on:” Now any analyses you choose will be done separately by sex (note that in order to analyze the data again as a whole, you need to go back to the Split File option and select “Analyze all cases, do not create groups”).

 

We noticed the linear relationship (between body weight and brain size) is stronger for females than males. Since the data is now split by sex, we can determine separate correlations (we couldn’t do this with our original data set-up). Go back to Analyze>Correlation>Bivariate, and select body weight and brain size as your variables. In the output window, note you get separate correlations for males and females. Do these agree with our previous scatterplot? Does the difference between males and females surprise you? Or do you have an explanation?

 

Description of HousePrices.sav

This data file contains information (selling price—in hundreds of dollars—and square feet of living space) on 102 houses sold in Albuquerque, New Mexico in 1993.

 

Analysis

What is the relationship between selling price (in $100) and square feet of living space? For example, can we use square footage to predict selling price? If so, this would be helpful information for realtors and home-owners in Albuquerque. Create a scatterplot of these variables. The relationship is positive and quite strong. Hence, it seems reasonable to use regression to explain and predict selling price based on square footage.

 

There are two ways to do simple regression in SPSS (one method simply shows a scatterplot with regression line, which is a nice visual; the other method provides more detailed—and necessary—information about the regression model). Double-click on the scatterplot you just created (to invoke the Chart Editor). From the editor choose Elements>Fit Line At Total. SPSS then creates a scatterplot with the regression line drawn in (and gives the  value—it used to provide the equation of the regression line, but this new version of SPSS does not). Note we have the visual of a regression line—which is nice—but we don’t have the equation of the line (and we aren’t able to analyze the residuals). Hence, from this scatterplot-with-regression-line, we aren’t able to explain or predict selling price and we aren’t able to assess our model via a residual plot.

 

For a more detailed regression analysis, go to the Analyze menu and select Regression>Linear. In the dialog box, select selling price as the dependent variable (i.e., response variable) and square feet as the independent variable (i.e., explanatory variable). Then click on the Save button and in the new dialog box simply select Residuals: Unstandardized (these are the residuals we discussed in the class). Recall it is important to look at a residual plot for each regression (if the residual plot shows any type of pattern, this indicates the regression model is somehow inadequate). Saving the residuals allows us to create this residual plot.

 

The regression output is shown in the output window (we’ll discuss this in detail in lab). There is a lot of output, but one important take-away is the equation of the regression line. In the “Coefficients” output table, there is a column labeled “B.”  The first value in that column is the numerical value of the y-intercept of the regression line (in this case, the value is 129.915); the second value in that column is the numerical value of the slope of the regression line (in this case, the value is 0.541). Hence, the equation of the regression line is , and we can interpret the slope value in the context of the problem: As the size of a house increases by 1 square foot, the predicted selling price increases by 0.541 hundred dollars (i.e., $54.10). Or, more informatively, as the size of a house increases by 100 square feet, the predicted selling price increases by 54.1 hundred dollars (i.e., $5,410).

 

We have two diagnostics to assess our model:  and the residual plot. Recall our interpretation of  (in the context of this problem): 73.5% of the variation in selling prices is explained by our regression line. This is fairly high (especially based on a single predictor variable). Now consider the residuals, which are saved in column 3. To create the basic residual plot (residuals versus explanatory variable), go to the Graphs>Chart Builder and make a scatterplot with Unstandardized Residuals on the y-axis and square feet on the x-axis.. It’s nice to have a horizontal line at 0 within the plot, as this is a nice visual reference of a perfect fit by the regression line. To add a reference line to a plot, first double-click on the graph to invoke the graph editor. Then select Options>Add Y-Axis Reference Line. In the new dialog box, change the “Position” to 0, and then hit the Apply button. Does the residual plot show a random scatter of points? Or a pattern? (Remember, don’t read too much into these plots—think big picture.)

 

Based on our diagnostics, a line is the best summary of the relationship in the data (based on the residual plot), and our line explains 73.5% of the variation in selling price. So a realtor in Albuquerque should feel good using this line to predict selling price (within the range of our data) and to explain the impact of each additional square feet of housing space. If the realtor wants an even better model, he can include additional variables (multiple regression). Thoughts on other variables that might be good predictors of selling price?

 

Description of AlcoholMetabolism.sav

Case study from The Statistical Sleuth, Second Edition, by Ramsey and Schafer, Duxbury Publishers:

“Women exhibit a lower tolerance for alcohol and develop alcohol-related liver disease more readily than men. When men and women of the same size and drinking history consume equal amounts of alcohol, the women on average carry a higher concentration of alcohol in their bloodstream. According to a team of Italian researchers, this occurs because alcohol-degrading enzymes in the stomach (where alcohol is partially metabolized before it enters the bloodstream and is eventually metabolized by the liver) are more active in men than in women. The researchers studied the extent to which the activity of the enzyme explained the first-pass alcohol metabolism and the extent to which it explained the differences in first-pass metabolism between women and men. This data file includes their data (M. Frezza, et al. (1990), “High Blood Alcohol Levels in Women,” New England Journal of Medicine, 322, pp. 95-99).”

 

“The subjects were 18 women and 14 men, all volunteers living in Trieste. All subjects received ethanol, at a dose of 0.3 grams per kilogram of body weights, orally one day and intravenously another, in randomly determined order. Since the intravenous administration bypasses the stomach, the difference in blood alcohol concentration—the concentration after intravenous administration minus the concentration after oral administration—provides a measure of the ‘first-pass metabolism’ (measured in millimol/liter-hour) in the stomach. In addition, gastric alcohol dehydrogenase (AD) activity (activity of the key enzyme) was measured (in micromol/min/g of tissue) in mucus samples taken from the stomach linings of the subjects.”

 

Analysis

You need to think about how to analyze these data appropriately—graphically and numerically—using SPSS (and your mind). Consider the following questions of interest. To answer each question, perform an appropriate and complete analysis in SPSS. Be prepared to share your results with the whole class. And feel free to ask question while your work.

 

  • How would you describe the distribution of first-pass metabolism for this sample? [Hints: First-pass metabolism is quantitative, so a histogram or boxplot can be used, and appropriate numerical summaries can also be provided (Analyze>Descriptives>Frequencies.]

 

  • Is the first-pass metabolism different for females and males? How so? [Hints: Side-by-side boxplots or histograms can be used—separating by sex. Separate descriptive statistics can be found using the Analyze>Descriptive Statistics>Explore option with sex in the factor list.]

 

  • How would you describe the distribution of gastric AD activity for this sample?

 

  • Is the gastric AD activity different for females and males? How so?

 

  • Is there a relationship between first-pass metabolism and gastric AD activity? If so, what is the nature of the relationship? [Hints: These are two quantitative variables, so a scatterplot is the starting place. Then, if reasonable, a regression line can be fit within the scatterplot.]

 

  • Is the relationship between first-pass metabolism and gastric AD activity different for females and males? If so, how? [Hints: Use a “grouped scatter” from the Chart Builder. Once the graph is created, you can add—through the chart editor—separate regression lines (Elements>Fit Lines by Subgroups).] Specifically, how can you interpret the difference is regression-line slopes and the difference in  values?