Math 445—Regression Analysis

 

Suppose we have two variables, and we’re interested in explaining and predicting one variable based on the other. For example, consider the selling price (in $100) and square footage of 115 houses sold in Albuquerque, New Mexico in 1993:

 

 

We want to both explain the relationship between square footage and selling price (e.g., how much more money, on average, do we get for 100 more square feet?) and also accurately predict the selling price for a new house, based solely on the square footage. Regression analysis can be used to do this.

 

The model that is applicable in the simplest regression structure (only one predictor variable, linear relationship) is the simple linear regression model:

 

Simple Linear Regression Model

, where the  are independent,  random variables and the  are fixed (not random) variables

 

In class, we’ll discuss what this model says about the conditional expectation and variance of the Y variable, and draw an appropriate graph to describe this model:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Estimation

We can use data to find estimators,  and , of the parameters for the population regression line. There are different methods that can be used to find reasonable estimators. The most-commonly used method is that of least-squares—find the line that minimizes the sum of the squared errors. That is, for a particular set of (x, y) pairs of data, find  and such that  is minimized. This is a straight-forward, calculus minimization problem in two variables.

Least-Squares Estimators and “Normal” Equations

You can do the calculus minimization as an exercise (I’m confident you can all do it). This least-squares minimization gives the following estimators for the y-intercept and slope of the population regression line (note the x values are fixed, but the  are random variables—hence the estimators are random variables):

 

 

 

Also, since  and  are the least-squares estimators, we know when we plug in these values to the partial derivatives of  , those partial derivatives are 0. This gives us what are typically referred to as the “normal equations”:

 

·         (that is, the residuals of a least-squares regression sum to 0)

·        

 

These equations can sometimes be helpful when doing calculations (and simplifications) with the regression-analysis theory.

 

Properties (Mean, Variance, and Distribution) of Least-Squares Estimator of the Slope

We will use  to estimate the slope of the population regression line. Hence, we want to know properties of . Is it an unbiased estimator? What is its variance? What is its distribution? We’ll work through the answers to these questions in class:

 


 

Estimation of the Common Variance

Notice that  is involved in the variance of , yet  is unknown. We need to find an estimator of . The textbook showed that   is the maximum-likelihood estimator of . Is this an unbiased estimator?

 

After some straight-forward (but grungy) algebra, we can show . That is, the MLE is a biased estimator. But it’s easy to create an unbiased estimator of :

 

 [This is typically denoted ]

 

Note that  is simply the i-th residual. We know the residuals sum to 0, so the sample mean of the residuals is 0. Then  is simply the sample variance of the residuals (we divide by (n-2) in the denominator, because two regression parameters are estimated so we “lose” two degrees of freedom). Since the residuals are estimates of the errors, , it makes sense that the sample variance of the residuals is our estimator of the variance of the errors, .

 

Inference about the Slope of the Population Regression Line

Typically we want to make inference (significance test or confidence interval) on the slope of the population regression line. We can use our estimator, , to do this inference (note: we need the conditions of normality and constant variance, in order for the theory to work out). We’ll work through these derivations in class:


 

Confidence Interval for a Mean Value

Suppose we want to estimate E(Y) for a fixed value of x, and we want to do so with a confidence interval. For our simple linear regression model, . Then the predicted value, , is an estimator of , and it’s an unbiased estimator (assuming the true model is a straight line).

 

What is the variance of this estimator? And what is its distribution? We’ll work through these details in class (note: we need the conditions of normality and constant variance, in order for the theory to work out):

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 

Prediction Interval for a New Value

Now suppose instead of estimating the mean value of Y, we want to predict a new value of Y at a fixed value of x (this will be more difficult than estimating a mean value—now we must include both the variability due to the fact that the least-squares regression line is not exactly equal to the true regression line and the variability of the future response variable, Y, around the sub-population mean). We’ll work through these details in class (note: we need the conditions of normality and constant variance, in order for the theory to work out):

 

 


 

Regression Diagnostics

We can use the residuals to check our regression model:

·         If we specified the model (i.e., a line) correctly and the conditions of the model are correct, then the basic residual plot (residuals versus fitted values) should simply show a random scatter of points—that is, we modeled the data as well as possible and all that’s left over is random variation. Any example of a “random scatter” of points is shown below. (Note: Curvature in the residual plot would indicate that a quadratic, not linear, model should be fit to the data.)

 

·         The basic residual plot can also show a violation of the constant variance condition. For example, the plot below shows the variability of the residuals increases as x decreases. If this constant-variance condition is violated, the remediation is to transform one or both the variables (e.g., log and square-root transformations typically work well) and then re-run the regression.

 

·         We can also use the residuals to see if the normality condition is met: create a histogram and normality plot of the residuals. If there is a big deviation from normality, then a transformation might be helpful, but first remediate based on the previous two bullet points (linearity and constant-variance) before working with the normality condition (the other changes will impact normality).

 

If the normality or constant-variance conditions are seriously violated, then we cannot trust any inference of our model (e.g., significance test on the slope, confidence interval for a mean, prediction interval for a new value).

 

 

Another diagnostic for regression is called the Coefficient of Determination,  (in simple, linear regression, this is numerically the same as the correlation squared). It can be shown (in fact, you’ll show this on your homework):

 

In short-hand, SST = SSR +SSE. Conceptually, the left-hand side is the total variability in the response variable. The right-hand side is the variability of the fitted values (that is the variability “explained” by the regression line) and the variability of the residuals (that is, the variability “unexplained” by the line).

 

By definition, . That is,  is the proportion of the variability in the response variable that is explained by the regression line. The higher the  the better the model and more we trust our predictions (in multiple regression, it is better to consider adjusted , which takes into consideration the number of predictors).


 

Example

Data were collected on 115 homes sold in Albuquerque, New Mexico in 1993. As mentioned previously, we’re particularly interested in the relationship between square footage and selling price—is square footage an important predictor of selling price? The data are shown in the scatterplot below.

 

 

Because a linear relationship between these variables seems reasonable, we can proceed with the regression analysis. The regression output from Minitab is shown below.

 

Regression Analysis

The regression equation is

selling price (in $100) = - 61.8 + 0.682 square feet of living space

 

Predictor                       Coef  SE Coef      T      P

Constant                      -61.81    53.22  -1.16  0.248

square feet of living space  0.68246  0.03126  21.83  0.000

 

S = 162.902   R-Sq = 80.8%   R-Sq(adj) = 80.7%

 

Analysis of Variance

Source           DF        SS        MS       F      P

Regression        1  12646380  12646380  476.56  0.000

Residual Error  113   2998686     26537

Total           114  15645066

 

Predicted Values for New Observations

New Obs      Fit   SE Fit           95% CI            95% PI

2000      1303.1     19.1  (1265.3, 1340.9)  (978.2, 1628.0)

 

The value of R-squared indicates that 80.8% of the variation in selling price is explained by its linear relationship with square footage. This is a fairly high R-squared value. Hence, we can feel pretty good about making predictions based on this model.

 

Before we do any inference based on the model, we must check the normality and constant-variance conditions by looking at appropriate graphs of the residuals:

 

  

 

The normality condition clearly appears to be met. What about the constant-variance condition? Note the variability in the residuals seems to “fan out” a bit (more variability in the residuals for higher-priced houses). This violation isn’t awful, but perhaps a transformation of the data should at least be considered (we’ll consider this in lab).

 

Now consider the slope of the regression line. A significance test on the true slope ( versus ) clearly shows evidence (p-value = 0.000) that the population slope is different from 0—assuming the population slope is 0 (that is, that square footage has no linear impact on the selling price of a house), there is essentially no chance of getting our sample slope value or a more extreme slope value (note: that is just the “definition of the p-value in the context of the problem” as we’ve discussed many time). This gives us strong evidence that the square footage of a house has a statistically significant linear impact on the selling price of a house (our p-value is smaller than any typically-used significance level).

 

But is this result practically significant? To answer this question, we can create a 95% confidence interval for the population slope: . (I got the t-value, 1.981, from Minitab—you can get an estimate of it, based on 120 df, from Table A.5: 1.98.) Hence, for each additional 100 square feet of living area, we are 95% confident that the selling price increases by between $6,210 and $7,430 (remember, interpretation of the slope in the context of the problem is part of the “explanation” piece of regression). As always, our confidence in the method we use (i.e., our methods gives correct results 95% of the time). Do you think this is of practical importance?

 

The last bit of output shows both a confidence interval for the average selling price and a prediction interval for the selling price of a new house with 2000 square feet (Minitab can easily create these intervals for any x-value of interest). Note the prediction interval is substantially wider (which isn’t surprising). Be sure to use the interval (confidence interval for a mean response or prediction interval for a new value) that best answers your particular research question.