Math 445—Regression Analysis
Suppose
we have two variables, and we’re interested in explaining and predicting
one variable based on the other. For example, consider the selling price (in
$100) and square footage of 115 houses sold in Albuquerque, New Mexico in 1993:

We
want to both explain the relationship between square footage and selling price
(e.g., how much more money, on average, do we get for 100 more square feet?)
and also accurately predict the selling price for a new house, based solely on
the square footage. Regression analysis can be used to do this.
The
model that is applicable in the simplest regression structure (only one
predictor variable, linear relationship) is the simple linear regression model:
Simple Linear Regression Model
, where the
are independent,
random variables and the
are fixed (not random)
variables
In
class, we’ll discuss what this model says about the conditional expectation and
variance of the Y variable, and draw an appropriate graph to describe this
model:
Estimation
We
can use data to find estimators,
and
, of the parameters for
the population regression line. There are different methods that can be used to
find reasonable estimators. The most-commonly used method is that of least-squares—find the line that
minimizes the sum of the squared errors. That is, for a particular set of (x, y)
pairs of data, find
and
such that
is minimized. This is a straight-forward,
calculus minimization problem in two variables.
Least-Squares Estimators and “Normal”
Equations
You
can do the calculus minimization as an exercise (I’m confident you can all do
it). This least-squares minimization gives the following estimators for the
y-intercept and slope of the population regression line (note the x values are
fixed, but the
are random variables—hence the estimators are
random variables):
![]()
![]()
Also,
since
and
are the least-squares estimators, we know when
we plug in these values to the partial derivatives of
, those partial
derivatives are 0. This gives us what are typically referred to as the “normal
equations”:
·
(that is, the residuals of a
least-squares regression sum to 0)
·
![]()
These equations can sometimes be helpful
when doing calculations (and simplifications) with the regression-analysis theory.
Properties (Mean, Variance, and
Distribution) of Least-Squares Estimator of the Slope
We
will use
to estimate the slope of the population
regression line. Hence, we want to know properties of
. Is it an unbiased
estimator? What is its variance? What is its distribution? We’ll work through
the answers to these questions in class:
Estimation of the Common Variance
Notice
that
is involved in the variance of
, yet
is unknown. We need to find an estimator of
. The textbook showed
that
is the maximum-likelihood estimator of
. Is this an unbiased
estimator?
After
some straight-forward (but grungy) algebra, we can show
. That is, the MLE is a
biased estimator. But it’s easy to create an unbiased estimator of
:
[This is typically denoted
]
Note
that
is simply the i-th
residual. We know the residuals sum to 0, so the sample mean of the residuals
is 0. Then
is simply the sample variance of the residuals
(we divide by (n-2) in the denominator, because two regression parameters are
estimated so we “lose” two degrees of freedom). Since the residuals are
estimates of the errors,
, it makes sense that the sample variance
of the residuals is our estimator of the variance of the errors,
.
Inference about the Slope of the
Population Regression Line
Typically
we want to make inference (significance test or confidence interval) on the
slope of the population regression line. We can use our estimator,
, to do this inference
(note: we need the conditions of normality and constant variance, in order for
the theory to work out). We’ll work through these derivations in class:
Confidence Interval for a Mean Value
Suppose
we want to estimate E(Y) for a fixed value of x, and we want to do so with a
confidence interval. For our simple linear regression model,
. Then the predicted
value,
, is an estimator of
, and it’s an unbiased
estimator (assuming the true model is a straight line).
What
is the variance of this estimator? And what is its distribution? We’ll work
through these details in class (note: we need the conditions of normality and
constant variance, in order for the theory to work out):
Prediction Interval for a New Value
Now
suppose instead of estimating the mean value of Y, we want to predict a new
value of Y at a fixed value of x (this will be more difficult than estimating a
mean value—now we must include both the variability due to the fact that the
least-squares regression line is not exactly equal to the true regression line and the variability of the future
response variable, Y, around the sub-population mean). We’ll work through these
details in class (note: we need the conditions of normality and constant
variance, in order for the theory to work out):
Regression Diagnostics
We
can use the residuals to check our
regression model:
·
If
we specified the model (i.e., a line) correctly and the conditions of the model
are correct, then the basic residual plot (residuals versus fitted values)
should simply show a random scatter of points—that is, we modeled the data as
well as possible and all that’s left over is random variation. Any example of a
“random scatter” of points is shown below. (Note: Curvature in the residual
plot would indicate that a quadratic, not linear, model should be fit to the
data.)

·
The
basic residual plot can also show a violation of the constant variance
condition. For example, the plot below shows the variability of the residuals
increases as x decreases. If this constant-variance condition is violated, the
remediation is to transform one or both the variables (e.g., log and
square-root transformations typically work well) and then re-run the
regression.

·
We
can also use the residuals to see if the normality condition is met: create a
histogram and normality plot of the residuals. If there is a big deviation from
normality, then a transformation might be helpful, but first remediate based on
the previous two bullet points (linearity and constant-variance) before working
with the normality condition (the other changes will impact normality).
If the normality or constant-variance
conditions are seriously violated, then we cannot trust any inference of our
model (e.g., significance test on the slope, confidence interval for a mean,
prediction interval for a new value).
Another
diagnostic for regression is called the Coefficient
of Determination,
(in simple, linear regression, this is
numerically the same as the correlation squared). It can be shown (in fact,
you’ll show this on your homework):
![]()
In
short-hand, SST = SSR +SSE. Conceptually, the left-hand side is the total
variability in the response variable. The right-hand side is the variability of
the fitted values (that is the variability “explained” by the regression line)
and the variability of the residuals (that is, the variability “unexplained” by
the line).
By
definition,
. That is,
is the proportion of the
variability in the response variable that is explained by the regression line. The higher the
the better the model and more we trust our
predictions (in multiple regression, it is better to consider adjusted
, which takes into
consideration the number of predictors).
Example
Data
were collected on 115 homes sold in

Because
a linear relationship between these variables seems reasonable, we can proceed
with the regression analysis. The regression output from Minitab is shown
below.
Regression
Analysis
The
regression equation is
selling
price (in $100) = - 61.8 + 0.682 square feet of living space
Predictor Coef SE Coef T
P
Constant -61.81 53.22
-1.16 0.248
square
feet of living space 0.68246 0.03126
21.83 0.000
S
= 162.902 R-Sq = 80.8% R-Sq(adj) = 80.7%
Analysis
of Variance
Source DF SS MS
F P
Regression 1
12646380 12646380 476.56
0.000
Residual
Error 113 2998686
26537
Total 114
15645066
Predicted
Values for New Observations
New
Obs
Fit SE Fit 95% CI 95% PI
2000 1303.1
19.1 (1265.3, 1340.9) (978.2, 1628.0)
The value of R-squared indicates that 80.8% of the variation in selling price is
explained by its linear relationship with square footage. This is a fairly high
R-squared value. Hence, we can feel pretty good about making predictions based
on this model.
Before we do any inference based on the
model, we must check the normality and constant-variance conditions by looking at appropriate
graphs of the residuals:

The normality condition clearly appears
to be met. What about the constant-variance condition? Note the variability in
the residuals seems to “fan out” a bit (more variability in the residuals for
higher-priced houses). This violation isn’t awful, but perhaps a transformation
of the data should at least be considered (we’ll consider this in lab).
Now consider the slope of the regression
line. A significance test on the true slope (
versus
) clearly shows evidence
(p-value = 0.000) that the population slope is different from 0—assuming the
population slope is 0 (that is, that square footage has no linear impact on the
selling price of a house), there is essentially no chance of getting our sample
slope value or a more extreme slope value (note: that is just the “definition
of the p-value in the context of the problem” as we’ve discussed many time). This
gives us strong evidence that the square footage of a house has a statistically significant linear impact
on the selling price of a house (our p-value is smaller than any typically-used
significance level).
But
is this result practically significant? To answer this question, we can create a
95% confidence interval for the population slope:
. (I got the t-value,
1.981, from Minitab—you can get an estimate of it, based on 120 df, from Table A.5: 1.98.) Hence, for each additional 100
square feet of living area, we are 95% confident that the selling price
increases by between $6,210 and $7,430 (remember,
interpretation of the slope in the context of the problem is part of the
“explanation” piece of regression). As always, our confidence in the method
we use (i.e., our methods gives correct results 95% of the time). Do you think
this is of practical importance?
The last bit of output shows both a
confidence interval for the average selling price and a prediction interval for
the selling price of a new house with 2000 square feet (Minitab can easily
create these intervals for any x-value of interest). Note the prediction
interval is substantially wider (which isn’t surprising). Be sure to use the interval (confidence interval for a mean response or
prediction interval for a new value) that best answers your particular research
question.