Math 217—Simple-Linear Regression Example

 

Data were collected on 115 homes sold in Albuquerque, New Mexico in 1993. Particularly, we’re interested in the relationship between square footage and selling price—is square footage an important predictor of selling price? The data are shown in the scatterplot below.

 

 

Because a linear relationship between these variables seems reasonable, we can proceed with the regression analysis. The regression output from Minitab is shown below.

 

Regression Analysis

The regression equation is

selling price (in $100) = - 61.8 + 0.682 square feet of living space

 

Predictor                       Coef  SE Coef      T      P

Constant                      -61.81    53.22  -1.16  0.248

square feet of living space  0.68246  0.03126  21.83  0.000

 

S = 162.902   R-Sq = 80.8%   R-Sq(adj) = 80.7%

 

Analysis of Variance

Source           DF        SS        MS       F      P

Regression        1  12646380  12646380  476.56  0.000

Residual Error  113   2998686     26537

Total           114  15645066

 

Predicted Values for New Observations

New Obs      Fit   SE Fit           95% CI            95% PI

2000      1303.1     19.1  (1265.3, 1340.9)  (978.2, 1628.0)

 

The value of R-squared indicates that 80.8% of the variation in selling price is explained by its linear relationship with square footage. This is a fairly high R-squared value. Hence, we can feel pretty good about making predictions based on this model.

 

Before we do any inference based on the model, we must check the normality and constant-variance conditions by looking at appropriate graphs of the residuals:

 

  

 

The normality condition clearly appears to be met. What about the constant-variance condition? Note the variability in the residuals seems to “fan out” a bit (more variability in the residuals for higher-priced houses). This violation isn’t awful, but perhaps a transformation of the data should at least be considered (we’ll consider this in lab).

 

Now consider the slope of the regression line. A significance test on the true slope ( versus ) clearly shows evidence (p-value = 0.000) that the population slope is different from 0—assuming the population slope is 0 (that is, that square footage has no linear impact on the selling price of a house), there is essentially no chance of getting our sample slope value or a more extreme slope value (note: that was just the “definition of the p-value in the context of the problem” as I mentioned in class). This gives us strong evidence that the square footage of a house has a statistically significant linear impact on the selling price of a house (our p-value is smaller than any typically-used significance level).

 

But is this result practically significant? To answer this question, we can create a 95% confidence interval for the population slope: (0.621, 0.743). (I got the t-value, 1.981, from Minitab—you can get an estimate of it, based on 100 df, from Table D: 1.984.) Hence, for each additional 100 square feet of living area, we are 95% confident that the selling price increases by between $6,210 and $7,430 (remember, interpretation of the slope in the context of the problem is part of the “explanation” piece of regression). As always, our confidence in the method we use (i.e., our methods gives correct results 95% of the time). Do you think this is of practical importance?

 

The last bit of output shows both a confidence interval for the average selling price and a prediction interval for the selling price of a new house with 2000 square feet (Minitab can easily create these intervals for any x-value of interest). Note the prediction interval is substantially wider (which isn’t surprising). Be sure to use the interval (confidence interval for a mean response or prediction interval for a new value) that best answers your particular research question.