Math 217 Computer Lab – Simple Linear Regression

 

Getting the Needed Files

Double click on the My Computer icon on the desktop. Then double click on the campus_share on 'curtis' (U:)  drive and then the Class_Share folder. Finally, double click on the Math folder and then the Math_217 folder. What you see in this folder are (among others) the Minitab files we will use in today’s lab: AlcoholTobaccoSpending.MPJ, AlcoholMetabolism.MPJ, and HousePricesAlbuquerque.MPJ.

 

As a class, we cannot access these share files (only one person can assess them at a time). Thus, you each need to copy the three files to your personal account. You can do this by simply highlighting all the files (click on the first one, then ctrl-click on the last one—this should highlight them all). Then press Ctrl-C to copy the files. Now open the My Documents folder on the desktop (this is the My Documents folder of your personal account). Once you are in the My Documents folder, hit Ctrl-V to paste the four files into your account.

 

Now open the Minitab software (from the Start menu select Programs>Class Programs and then Minitab>Minitab15). Then open the first file (HousePricesAlbuquerque.MPJ) in Minitab: go to the File menu and choose Open Project.

 

Description of HousePricesAlbuquerque.MPJ

This data file contains information (selling price and square feet of living space) on 115 homes sold in Albuquerque, New Mexico in 1993.

 

Analysis

We already discussed this example “on paper” in the classroom. Now we’ll investigate it with Minitab. Since we’re interested in the relationship between selling price and square footage, the first step is to create a scatterplot (Graph>Scatterplot>Simple). From this graph, it’s clear that a positive linear relationship exists between these variables.

 

There are multiple ways to perform regression analysis in Minitab. First we’ll use the fitted line plot function. From the Stat menu select Regression>Fitted Line Plot. Choose selling price as the response variable and square feet as the predictor variable. The default is linear regression, which is what we want. Now click on the Graphs button. Note you can select a histogram and normal plot of residuals (to check the normality condition), and you can also select a plot of residuals versus the fitted values (to check the constant-variance condition). But you cannot get a prediction interval for selling price based on a specific square footage value. (Also, this procedure doesn’t give as much regression output in the session window.)

 

A more flexible procedure when doing regression is Stat>Regression>Regression. Select selling price as the response and square feet as the predictor (note you can choose more than one predictor—this procedure allows you to perform multiple regression). Click on the Graphs button and you’ll see all the diagnostic graphs from which you can choose. Now click on the Options button. Note you can ask for a prediction interval (or confidence interval) for a new observation. Suppose we want a prediction interval for selling price based on a house with 2000 square feet. Enter the value 2000 in the appropriate box. By default the 95% level is used (but you can change this if desired).

 

Notice there is much output in the session window (more than when you used the fitted line plot option). We will discuss this as a class. At the last of the output, Minitab provides both a 95% confidence interval for the average selling price of houses with 2000 square feet and a 95% prediction interval for the selling price of one house with 2000 square feet.

 

The residual plot shows a slight fanning-out pattern. This indicates that the constant-variance condition might not be met. One possible remedy is to transform the response variable—often times a natural log transformation works to stabilize the variability. Create a new variable that is the natural log of selling price (Calc>Calculator). Then rerun the regression using log price as the response variable. What do you think—did this transformation help? (Note that you can easily transform predictions of log price back to price.)

 

Description of AlcoholTobaccoSpending.MPJ

This data file comes from the 1981 records of the Department of Employment (British official statistics). It shows the average weekly household spending, in British pounds, on tobacco products and alcoholic beverages for each of the 11 regions of Great Britain.  

 

Analysis

Suppose we’re interested in the relationship between alcohol and tobacco spending (specifically, if tobacco spending helps predict alcohol spending). Create a scatterplot of these two variables. What is the most obvious feature of this graph? Which region is the outlier? (Recall you can use the Editor>Brush function to identify the row of specific points. Furthermore, you can label all the points in a scatterplot based on a third, categorical variable—this will be helpful for your homework. Simply right-click on the scatterplot and choose Add>Data Labels. Then click on the “Use labels from column” circle and select the Region variable.)

 

We can now investigate the impact and influence of this outlier on the regression analysis. Highlight the second and third columns of the worksheet and copy them. Then paste them into the next two open columns. Notice that Minitab gives them a default title. Change the titles appropriately (e.g., “Alcohol Spending No N. Ireland”). Then delete the values from Northern Ireland (in the new columns), but initially leave an “*” in each spot (this is so the columns will all be of the same length and we can use the multiple-graph scatterplot option).

 

Now from the Graph menu choose Scatterplot>With Regression. Enter two Y and two X variables (your original data and the data with Northern Ireland deleted). Then click on the Multiple Graphs button and choose “overlaid on the same graph.” This graph shows how much the regression line changes when the Northern Ireland data point is removed. In this case, it’s probably best to do two separate analyses and provide both sets of results. To investigate this further (note the scatterplot option doesn’t give you the details of the regression analysis—only the regression line equation), you can perform regression analysis separately for the two sets of variables. Do these separate analyses. What changes do you notice? How does R-squared change? How does the line change? How do the residuals change? How does the inference on the population slope change? What would you tell a client if he/she asked you to do a complete regression analysis of these data?

 

Description of AlcoholMetabolism.MPJ

Case study from The Statistical Sleuth, Second Edition, by Ramsey and Schafer, Duxbury Publishers:

 

“Women exhibit a lower tolerance for alcohol and develop alcohol-related liver disease more readily than men. When men and women of the same size and drinking history consume equal amounts of alcohol, the women on average carry a higher concentration of alcohol in their bloodstream. According to a team of Italian researchers, this occurs because alcohol-degrading enzymes in the stomach (where alcohol is partially metabolized before it enters the bloodstream and is eventually metabolized by the liver) are more active in men than in women. The researchers studied the extent to which the activity of the enzyme explained the first-pass alcohol metabolism and the extent to which it explained the differences in first-pass metabolism between women and men. This data file includes their data (M. Frezza, et al. (1990), “High Blood Alcohol Levels in Women,” New England Journal of Medicine, 322, pp. 95-99).”

 

“The subjects were 18 women and 14 men, all volunteers living in Trieste. All subjects received ethanol, at a dose of 0.3 grams per kilogram of body weights, orally one day and intravenously another, in randomly determined order. Since the intravenous administration bypasses the stomach, the difference in blood alcohol concentration—the concentration after intravenous administration minus the concentration after oral administration—provides a measure of the ‘first-pass metabolism’ (measured in mmol/liter-hour) in the stomach. In addition, gastric alcohol dehydrogenase (AD) activity (activity of the key enzyme) was measured (in mol/min/g of tissue) in mucus samples taken from the stomach linings of the subjects.”

 

Analysis

Does gastric AD activity help predict the first-pass metabolism? And is this prediction different for males and females? First create a scatterplot of first-pass metabolism versus gastric AD activity. There seems to be a positive, linear relationship between these two variables. Now go back to the scatterplot option and choose With Groups (add Sex as the categorical variable). Do the patterns seem to be different for males and females? Go back to the scatterplot option and choose With Regression and Groups. What do you notice in this graph? Just based on this graph, what can we roughly say about the magnitude of the impact of gastric AD activity on first-pass metabolism (think about the slopes)?

 

The scatterplot option doesn’t allow you to do regression in detail. Suppose you want to analyze the data separately for males and females. Then from the Data menu select Split Worksheet. In the dialog box, select the Sex variable as the “By variable”. Minitab will then create two new worksheets separating the data by sex (note that it will also keep the original worksheet intact). The highlighted worksheet will be the active worksheet and it’s the active worksheet that Minitab will work with. It’s important that you label all your graphs appropriately (males or females) so you can keep track of your analysis (Minitab doesn’t do this for you).

Do a detailed regression analysis for both males and females. What differences or similarities do you notice?