MATH 207—Supplemental Material on Regression
Analysis (i.e., Additions to Chapter
3)
In most
studies/experiments there is a response
variable(s) that measures an outcome of a study (this is the variable in
which we’re really interested). Furthermore, there are explanatory variables that explain or cause change in the response
variable. [Terminology note: In other contexts, you might see the response
variable called the dependent variable, and you might see the explanatory
variable called an independent or predictor variable.]
In some
cases we want to predict and explain the response variable (y-variable) based
on the explanatory variable (x-variable). The method of regression can be used to do this. (Graphical motivation for the
method-of-least-squares will be shown in class.)
The Least-Squares Regression Line (of a response
variable on an explanatory variable)
·
This
is the line that minimizes the sum of squared vertical distances of data points
from the line (i.e., minimizes the
sum of the squared errors).
·
The
regression line differentiates between the response variable and the
explanatory variable, whereas the correlation coefficient does not.
·
The
regression line can be used as a model to predict and explain the response
variable.
Using
calculus (minimization problem in two variables), we can obtain the equation of
the least-squares regression line:
y-intercept, a, is
determined by the equation
.
Important Notes
·
The
predicted value from the regression equation is denoted
(to differentiate it from the actual
response-variable value, y).
·
Typically,
statistical software calculates the equation of the regression line, but it’s
interesting to see that in the two-variable setting, the regression line is
determined using summary statistics with which we’re already familiar (correlation,
sample mean, and sample standard deviation).
·
The
slope of the regression line is not equal
to the correlation coefficient (unless
or
).
·
A
regression model can be used to i) explain the response variable (for a
one-unit increase in the explanatory variable, the response variable is predicted to increase (or decrease) by
the slope value, b; when the explanatory variable is 0, the predicted response variable is the
intercept value, a), and ii) predict
the response variable for a specific value of the explanatory variable (done simply
by plugging into the equation of the regression line).
Example 1
Data
were collected on 93 cars of various makes and models from the year 1993. We
are interested in predicting the highway miles per gallon of a car using the
weight (in pounds) of a car. After looking at the scatterplot,
it seems reasonable to fit a least-squares regression line (i.e, the relationship in the scatterplot
appears linear). The equation of the least-squares line (via computer) is
. A scatterplot
(with regression line) and summary statistics for the two variables are
included below.

Variable N
Mean
StDev
Highway MPG 93
29.086 5.332
Weight (in lbs) 93
3072.90 589.90
Correlation
Coefficient
of Highway MPG and Weight = -0.811
a.
Using the given summary statistics, verify the equation of the
regression equation.
First determine the slope:
. Then determine the y-intercept:
(note
this value is off slightly from the computer output; this is due simply to
rounding—the computer keeps many more decimal places). Hence, based on the
summary statistics, we’ve verified the given regression line. (Note: In
practice, it’s never necessary to verify the computer calculations. This part
of the example simply shows the correct use of the regression-line equations.)
b.
Explanation
i.
In words, carefully
interpret the value of the slope of the regression line.
As the weight of a car increases by
one pound, the predicted MPG decreases by 0.007. Or, perhaps more
informatively, as the weight of a car increases by 100 pounds, the predicted
MPG decreases by 0.7.
ii.
In words, carefully
interpret the value of the y-intercept of the regression line.
In this situation, the y-intercept,
51.601 MPG, has no meaningful interpretation, as no car weights 0 pounds (or
anywhere near 0 pounds).
c.
Prediction
i.
What is the predicted
MPG for a car that weights 3,000 pounds?
To answer this question, we simply
plug into the equation of the regression line:
. For
a 3000-pound car, we predict 30.601 MPG.
ii.
What MPG would you
predict for a 5,000-pound car?
Now we must be careful. The value of
5,000 pounds is far outside the range of the collected data (and our regression
line is based solely on our collected data). Hence, we should not use the
regression line to make a prediction, as it might be very inaccurate. (Note:
This is one of many times in statistics where we could calculate a
number, but we should not do the calculation based on the context of the
problem.)
Regression Diagnostics (i.e., how well does
our regression line predict and explain?)
There
are many regression diagnostics, but we’ll keep things simple and only discuss
two:
The Square of the Correlation Coefficient, ![]()
·
In regression analysis, the numerical value
of
tells us the proportion
of the variation in the response variable that is explained by the regression
line. That is, there is natural variation in the response variable, but
tells us the proportion
of that variation that is explained by the linear relationship with the
explanatory variable. (In class we will briefly discuss the mathematics behind
this.)
·
Recall the correlation coefficient, r, is a
summary statistic that can accompany a scatterplot
(during a first look at the data). When doing regression analysis, though,
is a more informative
statistic.
·
The closer
is to 1 (or 100%), the
better the fit of the model and the stronger the predictive power of the
regression line (i.e., the more we
trust our predictions). Note there is no “gold standard” for a value of
, but simply put: the higher, the better (different disciplines
might have different expectations).
d.
(Car Example Continued) Determine the value of
for this regression, and, in
words, carefully interpret the value.
For the car data,
. Hence, the regression line explains 65.77% of the variation in
MPG of these cars. This is not very high. Perhaps other (appropriate) variables
could be added to the model in an effort to explain more of the variation.
Analysis of the Residuals
·
A residual is simply the predicted y-value
subtracted from the actual y-value. Using appropriate notation:
. (Be careful to do the subtraction in the appropriate way. For
example, a positive residual should indicate a point that falls above the
regression line.)
e.
(Car Example Continued) One
of the cars has a weight of 2,530 pounds and gets 30 MPG. What is the residual
associated with this car?
For this car, the regression line
predicts
MPG. So the residual for this
particular car is (30 – 33.891)= -3.891 MPG. (Note:
You might wonder about the magnitude of this number—that is, is it big? In our
cursory look at regression, we won’t address issues of magnitude, but realize
there are standardized residuals, and all sorts of other regression
diagnostics.)
·
Examining the residuals (all together)
helps us assess how well the regression line describes the overall relationship
in the data. The most basic residual plot is a scatterplot
of residuals (y-axis) versus the explanatory variable (x-axis). If the
regression line captures the overall relationship in the data, there should be
no pattern in the residual plot (just a random scatter/cloud of points around
the 0-line). That is, if the relationship in the data is modeled well, then all
that’s left over is random variation. A pattern in the residuals indicates that
the straight-line model is somehow inadequate.
f.
(Car Example Continued) The
residual plot from this regression is shown below. What does this plot tell you
about the adequacy of the regression model?

There is a slight
fanning out of the residuals (for cars in the 1500-2500-pound range), otherwise the plot shows a “random scatter” of residuals
around the 0-line. That is, a straight-line seems a (mostly) appropriate model
for the MPG-and-Weight data.
·
Another example: For a hypothetical class, the exams scores and study time (in hours)
are shown in the scatterplot below. Because the
relationship seems linear, a regression line is fit. The residual plot from the
regression is also shown below. Does the residual plot show a random scatter of
points? A pattern? What does this tell you about the regression model?


The residual plot
definitely shows a fanning-out pattern (more variation in the residuals for
smaller study times). When residuals fan-out, typically either or both
variables can be transformed (square-root or logarithm transformation often
work well) and then the regression is re-run on the transformed variables.
(Clearly, this work would need to be done with computer software.)
·
One more example: For a
sample of girls, the age and average height are recorded. These variables are
shown in the scatterplot below. The linear
relationship is very strong, with a correlation coefficient of 0.994. A
regression line is fit to the data (predicted height = 27.62 + 2.58
age). The residual plot from the regression is also shown
below. What does the residual plot tell you about the regression model?


In this case the residual plot shows a clear
pattern of curvature. This indicates a quadratic curve (rather than a straight
line) is a better model for the relationship in these data. (Note this subtlety
might go unnoticed in the original scatterplot, but
is quite clear after examination of the residuals.)
Other Notes on Regression
·
As
is probably obvious, outliers can impact the least-squares regression line.
Some outliers (particularly outliers in the x-direction) can be quite
influential and might markedly change the equation of the regression line (and
the
value). In these cases, the regression
analysis can be done both with and without the outlier(s), and both sets of
results can be reported.
·
We
have barely scratched the surface on regression models and diagnostics. For
example, in most situations it seems natural to use more than one explanatory
variable to predict the response variable (especially for complex systems).
This idea (multiple regression) along with further
regression diagnostics are explored in Econometrics (and in my Applied
Statistical Methods course).
Last Bit on Regression Diagnostics
It’s
important to realize the
value and the residual plot tell us different
things about a regression line. The
value measures the
proportion of the variation in the response variable that is explained by the
regression line. If this value is low, then there is a lot of variability left
unexplained, and we cannot trust our predictions. The residual plot tells us if
a line is actually the best way to describe the relationship in the data (a
random scatter of residuals means the regression line is a good description; a
pattern in the residuals means there is a better description than a line or a
transformation should be used).
Example A: High
value and no pattern in the residuals
(A line is the best way to describe the
relationship in the data and most all of the variation in the response variable
is explained. Yeah!)


Example B: High
value, yet a pattern in the residuals
(Most of the variation in the response
variable is explained, but a line is not the best way to characterize the
relationship—a curve would fit the data better, and then
would increase, too.)


Example C: No pattern in
the residuals, but a low
value
(A line is the best way to describe the
relationship, but a lot of the variability in the response variable is left
unexplained; perhaps another variable should be added to the regression.)

