MATH
207—Supplemental Material on Regression Analysis (i.e., Additions to Chapter 3)
In most studies/experiments there is a response variable(s) that measures an outcome of a study (this is the variable in which we’re really interested). Furthermore, there are explanatory variables that explain or cause change in the response variable. [Terminology note: In other contexts, you might see the response variable called the dependent variable, and you might see the explanatory variable called an independent or predictor variable.]
In some cases we want to predict and explain the response variable (y-variable) based on the explanatory variable (x-variable). The method of regression can be used to do this. (Graphical motivation for the method-of-least-squares will be shown in class.)
The
Least-Squares Regression Line (of a
response variable on an explanatory variable)
· This is the line that minimizes the sum of squared vertical distances of data points from the line (i.e., minimizes the sum of the squared errors).
· The regression line differentiates between the response variable and the explanatory variable, whereas the correlation coefficient does not.
· The regression line can be used as a model to predict and explain the response variable.
Using calculus (minimization problem in two variables), we can obtain
the equation of the least-squares regression line:
y-intercept,
a, is determined by the equation
.
Important Notes
·
The predicted value from the regression
equation is denoted
(to differentiate it from the actual
response-variable value, y).
·
Typically, statistical software
calculates the equation of the regression line, but it’s interesting to see
that in the two-variable setting, the regression line is determined using
summary statistics with which we’re already familiar (correlation, sample mean,
and sample standard deviation).
·
The slope of the regression line is not equal to the correlation coefficient
(unless
or
).
·
A regression model can be used to i) explain
the response variable (for a one-unit increase in the explanatory variable, the
response variable is predicted to
increase—or decrease—by the slope value, b; when the explanatory variable is 0,
the predicted response variable is
the intercept value, a), and ii) predict
the response variable for a specific value of the explanatory variable (basically,
plug into the equation of the regression line).
Example
1
Data were collected on 93 cars of various makes and models from
the year 1993. We are interested in predicting the highway miles per gallon of
a car using the weight (in pounds) of a car. After looking at the scatterplot, it seems reasonable to fit a least-squares
regression line (i.e, the relationship in the scatterplot appears linear). The equation of the
least-squares line (via computer) is
.
A scatterplot (with regression line) and summary
statistics for the two variables are included below.

Variable N
Mean StDev
Highway MPG 93
29.086 5.332
Weight (in lbs) 93
3072.90 589.90
Correlation
Coefficient
of Highway MPG and Weight = -0.811
a. Using the given summary statistics, verify the equation of the
regression equation.
First
determine the slope:
. Then determine the y-intercept:
(note this value is off slightly from the
computer output; this is due simply to rounding—the computer keeps many more
decimal places). Hence, based on the summary statistics, we’ve verified the
given regression line. (Note: In practice, it’s never necessary to verify the
computer calculations. This part of the example simply shows the correct use of
the regression-line equations.)
b. Explanation
i.
In words, carefully interpret the value of the slope of the
regression line.
As the
weight of a car increases by one pound, the predicted MPG decreases by
0.007. Or, perhaps more informatively, as the weight of a car increases by 100
pounds, the predicted MPG decreases by 0.7.
ii.
In words, carefully interpret the value of the y-intercept of
the regression line.
In this
situation, the y-intercept, 51.601 MPG, has no meaningful interpretation, as no
car weights 0 pounds (or anywhere near 0 pounds).
c. Prediction
i.
What is the predicted MPG for a car that weights 3,000 pounds?
To answer
this question, we simply plug into the equation of the regression line:
. For a
3000-pound car, we predict 30.601 MPG.
ii.
What MPG would you predict for a 5,000-pound car?
Now we
must be careful. The value of 5,000 pounds is far outside the range of the
collected data (and our regression line is based solely on our collected data).
Hence, we should not use the regression line to make a prediction, as it might
be very inaccurate. This is one of many times in statistics where we could
calculate a number, but we should not do the calculation based on the
context of the problem.
Regression
Diagnostics (i.e., how well does our regression line
predict and explain?)
There are many regression diagnostics, but we’ll keep things
simple and only discuss two:
The
Square of the Correlation Coefficient, ![]()
·
In
regression analysis, the numerical value of
tells us the proportion of the variation in
the response variable that is explained by the regression line. That is, there
is natural variation in the response variable, but
tells us the proportion of that variation that
is explained by the linear relationship with the explanatory variable. (In
class we will briefly discuss the mathematics behind this.)
·
Recall
the correlation coefficient, r, is a summary statistic that can accompany a scatterplot (during a first look at the data). When doing
regression analysis, though,
is a
more informative statistic.
·
The
closer
is
to 1 (or 100%), the better the fit of the model and the stronger the predictive
power of the regression line (i.e.,
the more we trust our predictions). Note there is no “gold standard” for a value
of
, but simply put: the higher, the better
(different disciplines might have different expectations).
d. (Car Example Continued) Determine the value of
for this regression, and, in words, carefully interpret the
value.
For the car data,
. Hence, the regression line explains
65.77% of the variation in MPG of these cars. This is not very high. Perhaps
other (appropriate) variables could be added to the model in an effort to
explain more of the variation. Can you think of other variables to add?
Analysis
of the Residuals
·
A
residual is simply the predicted y-value subtracted from the actual y-value.
Using appropriate notation:
. (Be careful to do the subtraction in the
appropriate way. For example, a positive residual should indicate a point that
falls above the regression line.)
e. (Car Example Continued) One of the cars has a weight of 2,530 pounds
and gets 30 MPG. What is the residual associated with this car?
For
this car, the regression line predicts
MPG. So the residual for this particular car
is (30 – 33.891)= -3.891 MPG. (Note: You might wonder about the magnitude of
this number—that is, is it big? In our cursory look at regression, we won’t
address issues of magnitude, but realize there are standardized residuals, and
all sorts of other regression diagnostics.)
·
Examining
the residuals (all together) helps us assess how well the regression line
describes the overall relationship in the data. The most basic residual plot is
a scatterplot of residuals (y-axis) versus the
explanatory variable (x-axis). If the regression line captures the overall
relationship in the data, there should be no pattern in the residual plot (just
a random scatter/cloud of points around the 0-line). That is, if the
relationship in the data is modeled well, then all that’s left over is random
variation. A pattern in the residuals indicates that the straight-line model is
somehow inadequate.
f.
(Car Example Continued) The residual plot from this regression is
shown below. What does this plot tell you about the adequacy of the regression
model?

There
is a slight fanning out of the residuals (for cars in the 1500-2500-pound
range), otherwise the plot shows a “random scatter” of residuals around the
0-line. That is, a straight-line seems a (mostly) appropriate model for the
MPG-and-Weight data.
·
Another
example: For
a hypothetical class, the exams scores and study time (in hours) are shown in
the scatterplot below. Because the relationship seems
linear, a regression line is fit. The residual plot from the regression is also
shown below. Does the residual plot show a random scatter of points? A pattern?
What does this tell you about the regression model?


The
residual plot definitely shows a fanning-out pattern (more variation in the
residuals for smaller study times). When residuals fan-out, typically either or
both variables can be transformed (logarithm or square-root transformation
often work well) and then the regression is re-run on the transformed
variables. (Clearly, this work needs to be done with computer software.)
·
One
more example: For a sample of girls, the age and average height are recorded.
These variables are shown in the scatterplot below.
The linear relationship is very strong, with a correlation coefficient of
0.994. A regression line is fit to the data (predicted height = 27.62 + 2.58
age). The residual plot from the
regression is also shown below. What does the residual plot tell you about the
regression model?


In this case the
residual plot shows a clear pattern of curvature. This indicates a quadratic
curve (rather than a straight line) is a better model for the relationship in
these data. (Note this subtlety might go unnoticed in the original scatterplot, but is quite clear after examination of the
residuals.)
Other
Notes on Regression
·
As is probably obvious, outliers can
impact the least-squares regression line. Some outliers (particularly outliers
in the x-direction) can be quite influential and might markedly change the
equation of the regression line (and the
value). In these cases, the regression
analysis can be done both with and without the outlier(s), and both sets of
results can be reported.
·
We have barely scratched the surface on
regression models and diagnostics. For example, in most situations it seems
natural to use more than one explanatory variable to predict the response
variable (especially for complex systems). This idea (multiple regression and
model selection) along with further regression diagnostics are explored in other
courses. If you’re interested, take Math 217 (Applied Statistical Methods) or
Econ 380 (Econometrics)
Last
Bit on Regression Diagnostics
It’s important to realize the
value and the residual plot tell us different
things about a regression line. The
value measures the proportion of the variation
in the response variable that is explained by the regression line. If this
value is low, then there is a lot of variability left unexplained, and we
cannot trust our predictions. The residual plot tells us if a line is actually
the best way to describe the relationship in the data (a random scatter of
residuals means the regression line is a good description; a pattern in the
residuals means there is a better description than a line or a transformation
should be used).
Example A (generic,
no-context variables): High
value and no pattern in the residuals
(A line is the best way
to describe the relationship in the data and most all of the variation in the
response variable is explained. Yeah! We can feel confident using this model to
explain and predict the response variable.)


Example B (generic,
no-context variables): High
value, yet a pattern in the residuals
(Most of the variation
in the response variable is explained, but a line is not the best way to
characterize the relationship—a curve would fit the data better, and then
would increase, too.)


Example C (generic,
no-context variables): No pattern in the residuals, but a low
value
(A line is the best way
to describe the relationship, but a lot of the variability in the response variable
is left unexplained; perhaps another variable should be added to the regression.)

