Double click on
the My Computer icon on the desktop. Then double click on the campus_share on 'curtis' (U:) drive and then the Class_Share
folder. Finally, double click on the Math
folder and then the math_445 folder. What you see in this folder are the
Minitab files (among others) we will use in today’s lab: Cars.MPJ,
RealEstateSales.MPJ, and Senility.MPJ. Copy these two files into your own
account.
Description of Cars.MPJ
This data file
contains information on 93 different cars sold in the year 1993. Can the weight
of a car accurately be used to predict the highway miles per gallon?
Analysis
Since we’re
interested in the relationship between MPG and weight, the first step is to
create a scatterplot (Graph>Scatterplot>Simple). From
this graph, it’s clear that a negative linear relationship exists between these
variables.
There are
multiple ways to perform regression analysis in Minitab. First we’ll use the fitted
line plot function. From the Stat
menu select Regression>Fitted Line
Plot. Choose highway MPG as the response variable and weight as the
predictor variable. The default is linear regression, which is what we want.
Now click on the Graphs button. Note
you can select a histogram and normal plot of residuals (to check the normality
condition), and you can also select a plot of residuals versus the fitted
values (to check the constant-variance condition). But you cannot get a
prediction interval for selling price based on a specific square footage value.
(Also, this procedure doesn’t give as much regression output in the session
window.)
A more flexible option when doing
regression is Stat>Regression>Regression. Select
highway MPG as the response and weight as the predictor (note you can choose
more than one predictor—this procedure allows you to perform multiple
regression). Click on the Graphs
button and you’ll see all the diagnostic graphs from which you can choose. Now
click on the Options button. Note you
can ask for a prediction interval (or confidence interval) for a new
observation. Suppose we want a prediction interval for highway MPG based on a car
that weighs 3000 pounds. Enter the value 3000 in the appropriate box. By
default the 95% level is used (but you can change this if desired).
Notice there is
much output in the session window (more than when you used the fitted line plot
option). We will discuss this as a class. At the last of the output, Minitab
provides both a 95% confidence interval for the average highway MPG for cars
that weigh 3000 pounds and a 95% prediction interval for the highway MPG for
one car that weights 3000 pounds. (Note the prediction interval is clearly
wider. Also, the confidence interval for the average MPG is quite narrow (this
is because 3000 pounds is very close to the average weight of all the cars).
The residual
plot shows a slight fanning-out pattern. This indicates that the
constant-variance condition might not be met. One possible remedy is to
transform the response variable—often times a natural log transformation works
to stabilize the variability. Create a new variable that is the natural log of highway
MPG (Calc>Calculator). Then rerun
the regression using log MPG as the response variable. What do you think—did
this transformation help? (Note that you can easily transform predictions of
log MPG back to MPG.)
Suppose our
response variable (the variable we want to predict) is a binary variable. Then,
. If we use the simple linear model here
(i.e.,
), there is one obvious drawback: the
model might predict a value of
outside the range (0, 1).
One remedy is
to model
using a logit function:
(the logit function
assumes an S-shaped curve on the probabilities of occurrence—this makes sense
in some cases, but not others).

Then
.
The odds of occurrence at a
particular value of x are defined as
. Hence,
. That
is, we model the log-odds with a linear model. This is a more complicated
model to fit, but we can use Minitab to do the grungy work.
A sample of elderly people were given a psychiatric examination to
determine whether symptoms of senility were present (senility shows a decline
or deterioration of physical strength or mental functioning, especially as a
result of old age or disease). One possible predictor variable of senility is
the score on a subtest of the Wechsler Adult Intelligences Scale. This data
file includes the senility value (1 = symptoms present, 0 = no symptoms
present) and the WAIS subscale score for the 54 people in the study.
Analysis
First look
descriptively at the data. Create comparative boxplots
and numerical summaries. What do you notice? We want to predict senility based
on the Weschsler Adult Intelligences Scale. In this
case, the response variable is binary. Hence, we should use logistic
regression. From the Stat menu select
Regression>Binary Logistic Regression.
The response variable is senility and the model simply includes WAIS score
(note you can include multiple predictor variables in the model—that is, you
can do multiple logistic regression). From the Graphs button, select “Delta chi-square versus probability.” From the Storage button, select “Event
Probability.” You can look at the Options
and Results buttons, but we’ll leave
all the default selections.
Now consider
the output in the session window. First look at the “Test that all slopes are
zero” output. This G statistic follows a chi-square distribution with degrees
of freedom equal to the number of predictors. Clearly, this shows a significant
slope value (p-value = 0.001). The logistic regression table gives estimated
coefficients, standard errors, and significance tests (based on the
z-distribution). The WAIS score is clearly a significant predictor (p-value =
0.005). Note it also gives the odds ratio, which is simply
. We can give an interpretation of this
value: for each 1 unit increase in WAIS score, the odds of senility decrease by
a factor of 0.72. Also, for a score of 10 on the WAIS test, the predicted
probability of senility is
. Or, for a score of 8 on the WAIS test,
the predicted probability of senility is
. Of course, our trust in these
predictions depends on how good our model is.
In the output,
notice the “Goodness of Fit” tests. We’ll focus on the Pearson Chi-Square
value. In goodness-of-fit tests, the null hypothesis is that the data fit the
null distribution well (this is one of the few significance tests where the
null hypothesis is actually the “research” hypothesis, so not rejecting the
null hypothesis is actually a good thing). In the logistic regression case, the
null hypothesis is that the predicted values and actual values agree well (fit
well). The Pearson Chi-Square test statistic calculates differences between the
observed and predicted values for each observation, squares them, adds them up,
and divides each squared difference by an estimate of its variance. Big values
of this test statistic indicate that the predicted values don’t fit the actual
data well. Small values of this test statistic give no indication that the
predicted values don’t fit well. Hence, large
p-values give us no indication that
our model doesn’t fit well. (This doesn’t necessarily say our model does
fit well, but at least we have no evidence against this hypothesis.)
It’s also
interesting to see how particular values impact the chi-square statistic. For
the plot we chose, each observation is removed one at a time from the data set
and the summary goodness-of-fit chi-square statistic is recalculated. The
change (delta) in chi-square provides an idea of how each particular
observation affects the chi-square. You can see from the plot that two of the
values stand out. Finally, you can create a scatterplot
of the event probabilities (estimated by the model) versus the WAIS scores.
Multiple Regression
Big question:
How do we choose a model (based on many predictor variables)? It depends on the
research objective. For example, you might want to
·
Control for a large set of explanatory
variables (e.g., find the impact of gender on income, but in the presence of
all other variables that might also impact income—these define your model)
·
Confirm a model based in theory (e.g., an
economic model)—the theory provides the variables, then you can see if your data
agree if the theory (don’t cheat by both
creating and confirming your model with the same set of data)
·
Find
the set of explanatory variables that best predicts the response
variable (computer-intensive selection methods can be used for this)
·
Find
a set of explanatory variables that explains/describes the response
variable, without any theory to guide you; computer-intensive selection methods
can be used for this, but you should be
very careful in your final interpretation—this is “data snooping,” where
you use the data to both create and confirm your model; you can still get
informative results, but do they apply broadly?
o
Collect
new data and see if your model still works
o
Initially
(randomly) divide your data set into two pieces: one used to create your model
and the other to confirm it
Description of RealEstateSales.MPJ
Data were
collected on 522 homes sold in a Midwestern City. The variables measured were
sales price (in dollars), finished square feet, number of bedrooms, number of
bathrooms, air conditioning status, garage size (number of cars it will hold),
pool status, year built, index for quality of construction (1, 2, or 3—1 being
highest quality), lot size (in square feet), and adjacent-highway status.
Analysis
Suppose we want
to find the model that best explains sales price (in dollars). We could
randomly divide our cases into a model-creation set and a confirmation set, but
to ease our computations in this lab, we’ll simply “data snoop,” and assume all
our conclusions must be confirmed on a new set of data.
A good first
step is to create a matrix of scatterplots (Graph>Matrix Plot). Including all
variables makes the plots difficult to read, but you can create two different
matrix plots (be sure to include the response variable—sales price—in each
matrix plot). What do you notice? What relationships do you see?
You can use
stepwise regression to find a model (Regression>Stepwise)—we’ll
discuss the details of this in class. What predictor variables does this method
suggest? Run a multiple regression with these predictors (include plots of the
residuals). What do you notice in the residual plot? Try a natural-log
transformation on the sales price, and then re-run the regression. Did that
help? Note, though, that the interpretation of the coefficient values is no
longer straight-forward nor informative.