MATH 207—Supplemental Material on Regression Analysis (i.e., Additions to Chapter 3)

 

In most studies/experiments there is a response variable(s) that measures an outcome of a study (this is the variable in which we’re really interested). Furthermore, there are explanatory variables that explain or cause change in the response variable. [Terminology note: In other contexts, you might see the response variable called the dependent variable, and you might see the explanatory variable called an independent or predictor variable.]

 

In some cases we want to predict and explain the response variable (y-variable) based on the explanatory variable (x-variable). The method of regression can be used to do this. (Graphical motivation for the method-of-least-squares will be shown in class.)

 

The Least-Squares Regression Line (of a response variable on an explanatory variable)

·         This is the line that minimizes the sum of squared vertical distances of data points from the line (i.e., minimizes the sum of the squared errors).

·         The regression line differentiates between the response variable and the explanatory variable, whereas the correlation coefficient does not.

·         The regression line can be used as a model to predict and explain the response variable.

 

Using calculus (minimization problem in two variables), we can obtain the equation of the least-squares regression line:  y-intercept, a, is determined by the equation .

 

Important Notes

·         The predicted value from the regression equation is denoted  (to differentiate it from the actual response-variable value, y).

 

·         Typically, statistical software calculates the equation of the regression line, but it’s interesting to see that in the two-variable setting, the regression line is determined using summary statistics with which we’re already familiar (correlation, sample mean, and sample standard deviation).

 

·         The slope of the regression line is not equal to the correlation coefficient (unless  or  ).

 

·         A regression model can be used to i) explain the response variable (for a one-unit increase in the explanatory variable, the response variable is predicted to increase (or decrease) by the slope value, b; when the explanatory variable is 0, the predicted response variable is the intercept value, a), and ii) predict the response variable for a specific value of the explanatory variable (done simply by plugging into the equation of the regression line).

 

Example 1

Data were collected on 93 cars of various makes and models from the year 1993. We are interested in predicting the highway miles per gallon of a car using the weight (in pounds) of a car. After looking at the scatterplot, it seems reasonable to fit a least-squares regression line (i.e, the relationship in the scatterplot appears linear). The equation of the least-squares line (via computer) is  . A scatterplot (with regression line) and summary statistics for the two variables are included below.

Variable          N      Mean     StDev 

Highway MPG      93    29.086     5.332  

Weight (in lbs)  93   3072.90    589.90  

 

Correlation Coefficient of Highway MPG and Weight = -0.811

 

a.      Using the given summary statistics, verify the equation of the regression equation.

First determine the slope: . Then determine the y-intercept:  (note this value is off slightly from the computer output; this is due simply to rounding—the computer keeps many more decimal places). Hence, based on the summary statistics, we’ve verified the given regression line. (Note: In practice, it’s never necessary to verify the computer calculations. This part of the example simply shows the correct use of the regression-line equations.)

 

b.      Explanation

                                i.            In words, carefully interpret the value of the slope of the regression line.

As the weight of a car increases by one pound, the predicted MPG decreases by 0.007. Or, perhaps more informatively, as the weight of a car increases by 100 pounds, the predicted MPG decreases by 0.7.

 

                              ii.            In words, carefully interpret the value of the y-intercept of the regression line.

In this situation, the y-intercept, 51.601 MPG, has no meaningful interpretation, as no car weights 0 pounds (or anywhere near 0 pounds).

 

c.      Prediction

                                i.            What is the predicted MPG for a car that weights 3,000 pounds?

To answer this question, we simply plug into the equation of the regression line: . For a 3000-pound car, we predict 30.601 MPG.

 

                              ii.            What MPG would you predict for a 5,000-pound car?

Now we must be careful. The value of 5,000 pounds is far outside the range of the collected data (and our regression line is based solely on our collected data). Hence, we should not use the regression line to make a prediction, as it might be very inaccurate. (Note: This is one of many times in statistics where we could calculate a number, but we should not do the calculation based on the context of the problem.)

 

 

Regression Diagnostics (i.e., how well does our regression line predict and explain?)

There are many regression diagnostics, but we’ll keep things simple and only discuss two:

 

The Square of the Correlation Coefficient,

·         In regression analysis, the numerical value of  tells us the proportion of the variation in the response variable that is explained by the regression line. That is, there is natural variation in the response variable, but  tells us the proportion of that variation that is explained by the linear relationship with the explanatory variable. (In class we will briefly discuss the mathematics behind this.)

 

·         Recall the correlation coefficient, r, is a summary statistic that can accompany a scatterplot (during a first look at the data). When doing regression analysis, though,  is a more informative statistic.

 

·         The closer  is to 1 (or 100%), the better the fit of the model and the stronger the predictive power of the regression line (i.e., the more we trust our predictions). Note there is no “gold standard” for a value of , but simply put: the higher, the better (different disciplines might have different expectations).

 

d.      (Car Example Continued) Determine the value of for this regression, and, in words, carefully interpret the value.

For the car data, . Hence, the regression line explains 65.77% of the variation in MPG of these cars. This is not very high. Perhaps other (appropriate) variables could be added to the model in an effort to explain more of the variation.

Analysis of the Residuals

·         A residual is simply the predicted y-value subtracted from the actual y-value. Using appropriate notation: . (Be careful to do the subtraction in the appropriate way. For example, a positive residual should indicate a point that falls above the regression line.)

 

e.      (Car Example Continued) One of the cars has a weight of 2,530 pounds and gets 30 MPG. What is the residual associated with this car?

For this car, the regression line predicts  MPG. So the residual for this particular car is (30 – 33.891)= -3.891 MPG. (Note: You might wonder about the magnitude of this number—that is, is it big? In our cursory look at regression, we won’t address issues of magnitude, but realize there are standardized residuals, and all sorts of other regression diagnostics.)

 

·         Examining the residuals (all together) helps us assess how well the regression line describes the overall relationship in the data. The most basic residual plot is a scatterplot of residuals (y-axis) versus the explanatory variable (x-axis). If the regression line captures the overall relationship in the data, there should be no pattern in the residual plot (just a random scatter/cloud of points around the 0-line). That is, if the relationship in the data is modeled well, then all that’s left over is random variation. A pattern in the residuals indicates that the straight-line model is somehow inadequate.

 

f.        (Car Example Continued) The residual plot from this regression is shown below. What does this plot tell you about the adequacy of the regression model?

 

There is a slight fanning out of the residuals (for cars in the 1500-2500-pound range), otherwise the plot shows a “random scatter” of residuals around the 0-line. That is, a straight-line seems a (mostly) appropriate model for the MPG-and-Weight data.

 

·         Another example: For a hypothetical class, the exams scores and study time (in hours) are shown in the scatterplot below. Because the relationship seems linear, a regression line is fit. The residual plot from the regression is also shown below. Does the residual plot show a random scatter of points? A pattern? What does this tell you about the regression model?

 

The residual plot definitely shows a fanning-out pattern (more variation in the residuals for smaller study times). When residuals fan-out, typically either or both variables can be transformed (square-root or logarithm transformation often work well) and then the regression is re-run on the transformed variables. (Clearly, this work would need to be done with computer software.)

·         One more example: For a sample of girls, the age and average height are recorded. These variables are shown in the scatterplot below. The linear relationship is very strong, with a correlation coefficient of 0.994. A regression line is fit to the data (predicted height = 27.62 + 2.58age). The residual plot from the regression is also shown below. What does the residual plot tell you about the regression model?

 

In this case the residual plot shows a clear pattern of curvature. This indicates a quadratic curve (rather than a straight line) is a better model for the relationship in these data. (Note this subtlety might go unnoticed in the original scatterplot, but is quite clear after examination of the residuals.)

 

 

Other Notes on Regression

·         As is probably obvious, outliers can impact the least-squares regression line. Some outliers (particularly outliers in the x-direction) can be quite influential and might markedly change the equation of the regression line (and the  value). In these cases, the regression analysis can be done both with and without the outlier(s), and both sets of results can be reported.

 

·         We have barely scratched the surface on regression models and diagnostics. For example, in most situations it seems natural to use more than one explanatory variable to predict the response variable (especially for complex systems). This idea (multiple regression) along with further regression diagnostics are explored in Econometrics (and in my Applied Statistical Methods course).

 

Last Bit on Regression Diagnostics

It’s important to realize the  value and the residual plot tell us different things about a regression line. The value measures the proportion of the variation in the response variable that is explained by the regression line. If this value is low, then there is a lot of variability left unexplained, and we cannot trust our predictions. The residual plot tells us if a line is actually the best way to describe the relationship in the data (a random scatter of residuals means the regression line is a good description; a pattern in the residuals means there is a better description than a line or a transformation should be used).

 

Example A: High  value and no pattern in the residuals

(A line is the best way to describe the relationship in the data and most all of the variation in the response variable is explained. Yeah!)

 

Example B: High  value, yet a pattern in the residuals

(Most of the variation in the response variable is explained, but a line is not the best way to characterize the relationship—a curve would fit the data better, and then would increase, too.)

 

 

 

 

Example C: No pattern in the residuals, but a low  value

(A line is the best way to describe the relationship, but a lot of the variability in the response variable is left unexplained; perhaps another variable should be added to the regression.)