Math 445—Significance Testing
Like
confidence-interval theory, significance testing allows us to make
generalizations (inference) about an entire population based on a single random
sample. Sometimes, the answer to a research question is not to simply estimate
a population value, but to test a specific hypothesis about that value.
The
statement being testing in a significance test is called the null hypothesis.
·
The
null hypothesis is denoted
.
·
is a statement about a population, expressed
in terms of some parameter(s). For example, expressed in terms of the mean,
proportion, or variance.
·
Typically
is a statement of
“status quo” or “no difference.” Furthermore, in many tests,
is an equality statement (e.g.,
).
·
We
assess the strength of the evidence against
the null hypothesis. (This is like a proof by contradiction, where the
contradiction is not absolute.)
The alternative hypothesis is the statement
we think is true instead of the null hypothesis.
·
The
alternative hypothesis is denoted
.
·
is typically the
research hypothesis, which can be one- or two-sided (only one-sided if there is
previous research to indicate a deviation from “status quo” in only one
direction).
The test statistic measures the
compatibility between the data (in our particular random sample) and the null
hypothesis. Like in confidence-interval theory, it’s important that the test
statistic be a random variable which 1)
is a function of both the unknown population parameter and the data, and 2) has a distribution that is completely
known—including parameters—when the null hypothesis is true.
The p-value is the probability, assuming
the null hypothesis is true, of obtaining our particular test statistic or a
more extreme test-statistic value (extreme in the direction of the alternative
hypothesis). The smaller the p-value,
the stronger the evidence against
. How small does the p-value need to be in
order to have enough evidence against
?
There
are two types of error that can occur in a significance test:
|
|
Null-Hypothesis
Status |
|
|
Conclusion |
True |
False |
|
Reject |
Type I error |
Correct |
|
Accept |
Correct |
Type II error |
A Type I error is when we reject
when, in fact, it’s
actually true. The probability of a Type I error is called the significance level and it’s denoted
. Note the significance level is actually a
conditional probability:
. We typically choose small values for the
significance level (because this is actually under the researcher’s control).
The most typical value of
is 0.05. A more
stringent value (again, commonly used) is 0.01.
If
p-value
then we have enough
evidence to reject
, and the test results are said to be statistically significant at the
level. (This addresses
the question of “how small should the p-value be?” Note, though, the p-value
takes on a continuum of values, so no
value should be treated
as magical. For example, the p-values of 0.049 and 0.051 are essentially the
same. Hence, even though 0.051 > 0.05, we should recognize we have strong
evidence against
even if there isn’t
“statistical significance.”)
The
p-value approach is more general (and more widely used) than the critical-value
approach (shown in the book). Based on the p-value, you can make a decision for
any value of
. It’s important you understand the p-value approach.
Example
What
is the normal body temperature? A 1992 JAMA article suggests the average body
temperature might be less than 98.6 degrees Fahrenheit. We can test this theory
via a significance test. Suppose the population of body temperatures follows a
normal distribution. Let
denote the average body temperature for the
population, and assume the standard deviation of body temperatures is
= 0.8 degrees Fahrenheit. (Note: This is an
unrealistic assumption—if we don’t know the population mean, we won’t know the
population standard deviation. We’ll get to the more realistic case of unknown
later.) The researcher is convinced that if
the status-quo average of 98.6 degrees is not true, then the real average is
lower. (Note: The alternative
hypothesis always comes from the researcher. This is an example of a one-sided
test. There are examples of two-sided tests in the textbook.)
Suppose
a random sample of 40 adults is taken (this would be difficult to do) and their
body temperatures are measured. From our particular sample,
= 98.2 degrees Fahrenheit. Use these data to
conduct a significance test.
Now
let’s consider the second type of error in significance testing: A Type II error is when we accept
when, in fact, it’s
actually false. The probability of a Type I error is denoted
. Note this probability is actually a conditional probability:
. Note that
is controlled (read:
chosen) by the researcher, but
is not always
controlled. Therefore, if p-value >
, we typically say “we do not have enough evidence to reject the null hypothesis,”
rather than “we accept the null hypothesis as true” (because the probability of
a Type II error might be high, so we protect ourselves with this language).
The
power of a statistical test is the
probability we reject
when it’s actually
false. Note
. A powerful test is desirable (gold
standard: power
). How can we calculate the power? First
we must set a desired level for
and choose a specific value of the parameter in the
alternative hypothesis.
Example
Think
about the court system in the United States (“innocent until proven guilty”).
Within this context, define the null and alternative hypothesis, p-value, Type
I error, Type II error, and power.
Example (Power Calculation)
Reconsider
the body temperature example. Suppose we want to test at the
level. Furthermore, we
want to determine the power of detecting a true average temperature of 98.4
degrees Fahrenheit (this is the specific value in the alternative hypothesis).
Determine the power of the test. (And think about the relationship between the
power and 1) the significance level, 2) the sample size, and 3) the specific value in the alternative
hypothesis.)
Example (Sample-Size Determination for a
Specific Power)
It
can be a proactive step to first discuss (with the researcher) the desired
“deviation from status quo” to be detected and the desired significance level.
Then a sample size can be determined (before any data are collected). Again,
reconsider the body temperature example. Suppose the research wants to test at
the
level. Furthermore, she
wants a power of 0.8 of detecting a true mean of 98.3 degrees Fahrenheit (that
is, a deviation of 0.3 degrees below the null-hypothesized value). How large of
a sample should she take?
Significance Test for a Population Mean
(Population standard deviation known)
Suppose
we have a random sample from a normal population with unknown mean,
, but known standard
deviation,
. Furthermore, suppose
we want to test the null hypothesis,
. Then we first calculate
the test statistic,
, which tells us the number of standard errors
our particular sample average is from the null-hypothesized population mean.
Then we can use the standard normal distribution to determine the p-value (recall the p-value depends on the direction of the alternative hypothesis):
Finally,
we define and interpret the p-value in
the context of the problem and provide a conclusion (which might depend on
a given significance level,
). Note: By the Central
Limit Theorem, if n is “large,” we can still use this test, even if our sample
isn’t from a normal distribution.
Relationship between Confidence Interval
and Significance Test:
A
level
, two-sided significance test rejects
the hypothesis
exactly
when the value
falls outside
the
confidence interval for
. (Put another way, the
significance test does not reject the hypothesis if the value
falls inside the corresponding confidence
interval.) This relationship also holds for one-sided tests and one-sided
confidence bounds.
More
Explanation on the Relationship between Confidence Interval and Significance
Test
For a more mathematical
explanation of the previous result, consider a two-sided test using
significance level
. Then the “acceptance region” (really
the “do-not-reject region”) for the test statistic is between -1.96 and 1.96.
Then (via simple algebra),
![]()
which says the value of
is
inside the 95% confidence interval.
Practical Significance versus Statistical
Significance
It’s
possible for test results to be statistically significant, yet not practically
significant. For example, you might find a statistically significant difference
in means (i.e., you can reject the
null hypothesis that the means are the same), yet in the context for the
problem the difference might not be practically important. (For example,
perhaps you find a significant difference in average decrease in cholesterol
for patients taking a drug versus patients taking a placebo, but the magnitude
of the difference in only 5 mg/dL. Doctors probably
won’t find this practically important—certainly not important enough to put
their patients on that drug.)
Hence, if you find statistically
significant test results, it’s a good idea to accompany your results with a
corresponding confidence interval (to assess the practical significance—but
realize it’s an expert in the field, not necessarily a statistician, who should
assess the practical importance).
Test for a Population Proportion
The
textbook discusses a large-sample test for a population proportion (based on
the standard normal distribution), and a small-sample test (based on the binomial
distribution). We do not often use this test in practice, because it’s uncommon
to have a situation where there is a precise value of p we want to test. Hence, when doing inference about a single
population proportion, we’ll stick with a confidence
interval. (That is, you can skip textbook Section 9.3. You can also skip
Section 9.4—which discusses more esoteric ideas about hypothesis testing—with the
exception of the idea of practical significance, which you must know.)
What
is we don’t know the population standard deviation? (This is the much more
realistic setting.) Then we can use the sample standard deviation to estimate
the population standard deviation, and our test statistic has a t-distribution
(not z-distribution).
Significance Test for a Population Mean
(Population standard deviation unknown)
Suppose
we have a random sample from a normal population with unknown mean,
,, and with unknown
standard deviation. Before using the t-procedures for inference, we must check the condition that the
population values follow a normal distribution. We can estimate the
population distribution using an appropriate graph of our sample data values.
If the distribution of our sample looks mound-shaped, then the condition has
been met, and we can continue with our analysis. If the sample-data
distribution deviates slightly from normality, then we can still use the t-procedures
(these procedures are “robust” in that the probability calculations required
are insensitive to small violations of the required conditions). But if the
sample size is small (n < 40) and the sample-data distribution looks very
non-normal, then the t-procedures should not be used.
If
we want to test the null hypothesis,
. Then we first
calculate the test statistic,
, which tells us the
number of standard errors our particular sample average is from the
null-hypothesized population mean. Then we use
the t-distribution with
degrees of freedom
determine the approximate p-value (recall the p-value depends on the direction of the alternative hypothesis).
Finally,
we define and interpret the p-value in
the context of the problem and provide a conclusion (which might depend on
a given significance level,
). And, if the results are statistically significant, we also consider the
practical significance.
Type
II error-rate and power calculations are more difficult when using the
t-distribution. We’ll use Minitab to do these calculations (the only power
calculations you must know how to “do by hand” are those for the one-sample
z-test).
Relationship between Confidence Interval
and Significance Test:
A
level
, two-sided significance test rejects
the hypothesis
exactly
when the value
falls outside
the
confidence interval
for
. (Put another way, the
significance test does not reject the hypothesis if the value
falls inside the corresponding confidence
interval.) This relationship also holds for one-sided tests and one-sided
confidence bounds.
Example
Excerpts from Textbook Problem 9.35:
The
industry standard for the amount of alcohol poured into many types of drinks (e.g., gin for a gin-and-tonic, whiskey
on the rocks) is 1.5 ounces. A sample of 8 bartenders (with at least five years
of experience) was asked to pour rum for a rum and coke into a short, wide
glass. The 8 pour amounts (in ounces) are summarized in the graphs and
numerical summaries below.


Variable
N Mean StDev Minimum
Q1 Median Q3
Maximum
Rum Pours (in oz.) 8
1.8163 0.2105 1.48
1.6775 1.805 1.9775
2.16
Check of Conditions
Suppose
this sample represents the population of all experienced bartenders. We want to
use this sample to test hypotheses about the average amount of rum poured in
rum and cokes. The population standard deviation is unknown, so we must
estimate it with the sample standard deviation. Hence, we must use a t-test,
rather than a z-test. But recall the t-test assumes the population being
sampled from is normal. This is an assumption we must check by looking at the
distribution of our sample data. With only 8 data points, it’s difficult to
check the normality assumption. We can see that the sample data distribution is
approximately symmetric, but because there are so few data points, we can’t see
where things “pile up” (if the points pile up in the center or not). Based on
the normality plot, though, it seems reasonable to assume the sample comes from
a normal distribution.
Statement of Hypotheses
Test Statistic Calculation
P-value Calculation
Conclusion
Assuming
the average rum pour (for a rum and coke) for all experienced bartenders is 1.5
ounces, there is only a 0.004 chance of getting our particular sample average
pour (1.82 ozs) or a more extreme average pour.
Because our data are so unlikely, we have strong evidence that the average rum
pour is not 1.5 ounces. (These results are statistically significant at all
common significance levels.) These data actually point to an average rum pour
that is greater than 1.5 ounces.
Practical Significance
We
know that our sample average of 1.8163 ounces is statistically significantly
different from the industry standard of 1.5 ounces (as shown above). But is it practically significant? To better
answer this question, we can create a 95% confidence interval for the
population mean rum pour (note that
):
![]()
We
are 95% confident that the average rum pour (for a rum and coke) for all experiences
bartenders is between 1.64 ounces and 1.99 ounces. [As a side note, notice that
the 95% confidence interval doesn’t contain the null-hypothesized value of 1.5
ounces, which agrees with our previous results—that the results are definitely
significant at the 0.05 level.] In the context of the problem, is this range of
values practically different from 1.5 ounces? That is, will the bar owners or
customers really care? Maybe, but we should ask the experts (e.g., bar owners)
to answer this question (e.g., perhaps this means a lot of lost money,
practically speaking).