Math 445 Computer Lab: Regression Analysis Using Minitab

 

Getting the Needed Files

Double click on the My Computer icon on the desktop. Then double click on the campus_share on 'curtis' (U:)  drive and then the Class_Share folder. Finally, double click on the Math folder and then the math_445 folder. What you see in this folder are the Minitab files (among others) we will use in today’s lab: Cars.MPJ, RealEstateSales.MPJ, and Senility.MPJ. Copy these two files into your own account.

 

Description of Cars.MPJ

This data file contains information on 93 different cars sold in the year 1993. Can the weight of a car accurately be used to predict the highway miles per gallon?

 

Analysis

Since we’re interested in the relationship between MPG and weight, the first step is to create a scatterplot (Graph>Scatterplot>Simple). From this graph, it’s clear that a negative linear relationship exists between these variables.

 

There are multiple ways to perform regression analysis in Minitab. First we’ll use the fitted line plot function. From the Stat menu select Regression>Fitted Line Plot. Choose highway MPG as the response variable and weight as the predictor variable. The default is linear regression, which is what we want. Now click on the Graphs button. Note you can select a histogram and normal plot of residuals (to check the normality condition), and you can also select a plot of residuals versus the fitted values (to check the constant-variance condition). But you cannot get a prediction interval for selling price based on a specific square footage value. (Also, this procedure doesn’t give as much regression output in the session window.)

 

A more flexible option when doing regression is Stat>Regression>Regression. Select highway MPG as the response and weight as the predictor (note you can choose more than one predictor—this procedure allows you to perform multiple regression). Click on the Graphs button and you’ll see all the diagnostic graphs from which you can choose. Now click on the Options button. Note you can ask for a prediction interval (or confidence interval) for a new observation. Suppose we want a prediction interval for highway MPG based on a car that weighs 3000 pounds. Enter the value 3000 in the appropriate box. By default the 95% level is used (but you can change this if desired).

 

Notice there is much output in the session window (more than when you used the fitted line plot option). We will discuss this as a class. At the last of the output, Minitab provides both a 95% confidence interval for the average highway MPG for cars that weigh 3000 pounds and a 95% prediction interval for the highway MPG for one car that weights 3000 pounds. (Note the prediction interval is clearly wider. Also, the confidence interval for the average MPG is quite narrow (this is because 3000 pounds is very close to the average weight of all the cars).

 

The residual plot shows a slight fanning-out pattern. This indicates that the constant-variance condition might not be met. One possible remedy is to transform the response variable—often times a natural log transformation works to stabilize the variability. Create a new variable that is the natural log of highway MPG (Calc>Calculator). Then rerun the regression using log MPG as the response variable. What do you think—did this transformation help? (Note that you can easily transform predictions of log MPG back to MPG.)

 

 

Logistic Regression

Suppose our response variable (the variable we want to predict) is a binary variable. Then,

. If we use the simple linear model here (i.e., ), there is one obvious drawback: the model might predict a value of  outside the range (0, 1).

 

One remedy is to model  using a logit function:  (the logit function assumes an S-shaped curve on the probabilities of occurrence—this makes sense in some cases, but not others).

 

 

Then  . The odds of occurrence at a particular value of x are defined as

 

. Hence, . That is, we model the log-odds with a linear model. This is a more complicated model to fit, but we can use Minitab to do the grungy work.

 

Description of Senility.MPJ

A sample of elderly people were given a psychiatric examination to determine whether symptoms of senility were present (senility shows a decline or deterioration of physical strength or mental functioning, especially as a result of old age or disease). One possible predictor variable of senility is the score on a subtest of the Wechsler Adult Intelligences Scale. This data file includes the senility value (1 = symptoms present, 0 = no symptoms present) and the WAIS subscale score for the 54 people in the study.

 

Analysis

First look descriptively at the data. Create comparative boxplots and numerical summaries. What do you notice? We want to predict senility based on the Weschsler Adult Intelligences Scale. In this case, the response variable is binary. Hence, we should use logistic regression. From the Stat menu select Regression>Binary Logistic Regression. The response variable is senility and the model simply includes WAIS score (note you can include multiple predictor variables in the model—that is, you can do multiple logistic regression). From the Graphs button, select “Delta chi-square versus probability.”  From the Storage button, select “Event Probability.” You can look at the Options and Results buttons, but we’ll leave all the default selections.

 

Now consider the output in the session window. First look at the “Test that all slopes are zero” output. This G statistic follows a chi-square distribution with degrees of freedom equal to the number of predictors. Clearly, this shows a significant slope value (p-value = 0.001). The logistic regression table gives estimated coefficients, standard errors, and significance tests (based on the z-distribution). The WAIS score is clearly a significant predictor (p-value = 0.005). Note it also gives the odds ratio, which is simply . We can give an interpretation of this value: for each 1 unit increase in WAIS score, the odds of senility decrease by a factor of 0.72. Also, for a score of 10 on the WAIS test, the predicted probability of senility is . Or, for a score of 8 on the WAIS test, the predicted probability of senility is . Of course, our trust in these predictions depends on how good our model is.

 

In the output, notice the “Goodness of Fit” tests. We’ll focus on the Pearson Chi-Square value. In goodness-of-fit tests, the null hypothesis is that the data fit the null distribution well (this is one of the few significance tests where the null hypothesis is actually the “research” hypothesis, so not rejecting the null hypothesis is actually a good thing). In the logistic regression case, the null hypothesis is that the predicted values and actual values agree well (fit well). The Pearson Chi-Square test statistic calculates differences between the observed and predicted values for each observation, squares them, adds them up, and divides each squared difference by an estimate of its variance. Big values of this test statistic indicate that the predicted values don’t fit the actual data well. Small values of this test statistic give no indication that the predicted values don’t fit well. Hence, large p-values give us no indication that our model doesn’t fit well. (This doesn’t necessarily say our model does fit well, but at least we have no evidence against this hypothesis.)

 

It’s also interesting to see how particular values impact the chi-square statistic. For the plot we chose, each observation is removed one at a time from the data set and the summary goodness-of-fit chi-square statistic is recalculated. The change (delta) in chi-square provides an idea of how each particular observation affects the chi-square. You can see from the plot that two of the values stand out. Finally, you can create a scatterplot of the event probabilities (estimated by the model) versus the WAIS scores.

 

 

Multiple Regression

Big question: How do we choose a model (based on many predictor variables)? It depends on the research objective. For example, you might want to

·         Control for a large set of explanatory variables (e.g., find the impact of gender on income, but in the presence of all other variables that might also impact income—these define your model)

·         Confirm a model based in theory (e.g., an economic model)—the theory provides the variables, then you can see if your data agree if the theory (don’t cheat by both creating and confirming your model with the same set of data)

·         Find the set of explanatory variables that best predicts the response variable (computer-intensive selection methods can be used for this)

·         Find a set of explanatory variables that explains/describes the response variable, without any theory to guide you; computer-intensive selection methods can be used for this, but you should be very careful in your final interpretation—this is “data snooping,” where you use the data to both create and confirm your model; you can still get informative results, but do they apply broadly?

o   Collect new data and see if your model still works

o   Initially (randomly) divide your data set into two pieces: one used to create your model and the other to confirm it

 

Description of RealEstateSales.MPJ

Data were collected on 522 homes sold in a Midwestern City. The variables measured were sales price (in dollars), finished square feet, number of bedrooms, number of bathrooms, air conditioning status, garage size (number of cars it will hold), pool status, year built, index for quality of construction (1, 2, or 3—1 being highest quality), lot size (in square feet), and adjacent-highway status.

 

Analysis

Suppose we want to find the model that best explains sales price (in dollars). We could randomly divide our cases into a model-creation set and a confirmation set, but to ease our computations in this lab, we’ll simply “data snoop,” and assume all our conclusions must be confirmed on a new set of data.

 

A good first step is to create a matrix of scatterplots (Graph>Matrix Plot). Including all variables makes the plots difficult to read, but you can create two different matrix plots (be sure to include the response variable—sales price—in each matrix plot). What do you notice? What relationships do you see?

 

You can use stepwise regression to find a model (Regression>Stepwise)—we’ll discuss the details of this in class. What predictor variables does this method suggest? Run a multiple regression with these predictors (include plots of the residuals). What do you notice in the residual plot? Try a natural-log transformation on the sales price, and then re-run the regression. Did that help? Note, though, that the interpretation of the coefficient values is no longer straight-forward nor informative.