Math 217—Multiple Regression Example

 

Data were collected on 522 homes sold in a Midwestern City. The variables measured were sales price (in dollars), finished square feet, number of bedrooms, number of bathrooms, air conditioning status, garage size (number of cars it will hold), pool status, year built, index for quality of construction (1, 2, or 3—1 being highest quality), lot size (in square feet), and adjacent-highway status.

 

In class, I’ll show you a number of graphs of the individual variables and of all the variables with house price. (Remember, it’s good practice to start your analysis at a basic graphical and numerical level, before jumping into multiple regression.)

 

Using both Minitab’s stepwise and best-subsets procedures, the following predictors seem most important: finished square feet, number of bedrooms, garage size, quality index, and lot size. The regression output for this particular model is shown below. How do you interpret all these results? (Remember, this is the potentially “dangerous” territory of using our data to both create and check our model. If we use this model as an explanation of sales price, we’ll want to check it on a new data set. Alternatively, I could have separated my data into a model-selection set and a model-testing set.)

 

 

The regression equation is

Sales Price (in dollars) = 142644 + 108 Finished Square Feet - 9129 Number of Bedrooms

                           + 23130 Garage Size (no. of cars) - 69804 Quality Index

                           + 1.07 Lot Size (in square feet)

 

 

Predictor                     Coef  SE Coef       T      P

Constant                    142644    28970    4.92  0.000

Finished Square Feet       108.304    6.670   16.24  0.000

Number of Bedrooms           -9129     3543   -2.58  0.010

Garage Size (no. of cars)    23130     5647    4.10  0.000

Quality Index               -69804     6753  -10.34  0.000

Lot Size (in square feet)   1.0664   0.2592    4.11  0.000

 

 

S = 67963.2   R-Sq = 76.0%   R-Sq(adj) = 75.7%

 

 

Analysis of Variance

Source           DF           SS           MS       F      P

Regression        5  7.52751E+12  1.50550E+12  325.94  0.000

Residual Error  516  2.38340E+12   4618994024

Total           521  9.91091E+12

 

 

Below are several plots of the residuals. What do these indicate about whether or not our model conditions are met?

 

 

 

After doing a log (base e) transformation on Sales Price, stepwise and best-subsets choose a slightly different model (which includes number of bathrooms rather than number of bedrooms). The regression results and residual plots for this new model are shown on the backside of this page. What do you think of the new model?


The regression equation is

Log Sales Price = 11.9 + 0.000279 Finished Square Feet + 0.0444 Number of Bathrooms

                  + 0.0694 Garage Size (no. of cars) - 0.218 Quality Index

                  + 0.000004 Lot Size (in square feet)

 

 

Predictor                        Coef     SE Coef       T      P

Constant                      11.9254      0.0824  144.79  0.000

Finished Square Feet       0.00027926  0.00001958   14.26  0.000

Number of Bathrooms           0.04444     0.01266    3.51  0.000

Garage Size (no. of cars)     0.06940     0.01576    4.40  0.000

Quality Index                -0.21766     0.01976  -11.02  0.000

Lot Size (in square feet)  0.00000370  0.00000072    5.12  0.000

 

 

S = 0.189423   R-Sq = 80.9%   R-Sq(adj) = 80.7%

 

 

Analysis of Variance

Source           DF      SS      MS       F      P

Regression        5  78.568  15.714  437.94  0.000

Residual Error  516  18.515   0.036

Total           521  97.083

 

 

 

General Note on Indicator (or “Dummy”) Variables

None of the categorical variables was deemed (by our computer procedures) to have a significant impact on sales price, in the presence of the other variables. Still, I want you to understand the idea of indicator variables and how to interpret the coefficients on indicator variables. Included below is the regression output from a simple regression of sales price on pool status. How do you interpret the slope coefficient? How would you interpret the coefficients if there were multiple indicator variables?

 

The regression equation is

Sales Price (in dollars) = 272396 + 79724 Pool?

 

 

Predictor    Coef  SE Coef      T      P

Constant   272396     6195  43.97  0.000

Pool?       79724    23589   3.38  0.001

 

 

S = 136564   R-Sq = 2.1%   R-Sq(adj) = 2.0%