Math 445—Significance Testing

 

Like confidence-interval theory, significance testing allows us to make generalizations (inference) about an entire population based on a single random sample. Sometimes, the answer to a research question is not to simply estimate a population value, but to test a specific hypothesis about that value.

 

The statement being testing in a significance test is called the null hypothesis.

·         The null hypothesis is denoted .

·          is a statement about a population, expressed in terms of some parameter(s). For example, expressed in terms of the mean, proportion, or variance.

·         Typically  is a statement of “status quo” or “no difference.” Furthermore, in many tests,  is an equality statement (e.g., ).

·         We assess the strength of the evidence against the null hypothesis. (This is like a proof by contradiction, where the contradiction is not absolute.)

 

The alternative hypothesis is the statement we think is true instead of the null hypothesis.

·         The alternative hypothesis is denoted .

·          is typically the research hypothesis, which can be one- or two-sided (only one-sided if there is previous research to indicate a deviation from “status quo” in only one direction).

 

The test statistic measures the compatibility between the data (in our particular random sample) and the null hypothesis. Like in confidence-interval theory, it’s important that the test statistic be a random variable which 1) is a function of both the unknown population parameter and the data, and 2) has a distribution that is completely known—including parameters—when the null hypothesis is true.

 

The p-value is the probability, assuming the null hypothesis is true, of obtaining our particular test statistic or a more extreme test-statistic value (extreme in the direction of the alternative hypothesis). The smaller the p-value, the stronger the evidence against . How small does the p-value need to be in order to have enough evidence against ?

 

There are two types of error that can occur in a significance test:

 

 

Null-Hypothesis Status

Conclusion

True

False

Reject

Type I error

Correct

Accept

Correct

Type II error

 

 

A Type I error is when we reject  when, in fact, it’s actually true. The probability of a Type I error is called the significance level and it’s denoted . Note the significance level is actually a conditional probability: . We typically choose small values for the significance level (because this is actually under the researcher’s control). The most typical value of  is 0.05. A more stringent value (again, commonly used) is 0.01.

 

If p-value  then we have enough evidence to reject , and the test results are said to be statistically significant at the  level. (This addresses the question of “how small should the p-value be?” Note, though, the p-value takes on a continuum of values, so no  value should be treated as magical. For example, the p-values of 0.049 and 0.051 are essentially the same. Hence, even though 0.051 > 0.05, we should recognize we have strong evidence against  even if there isn’t “statistical significance.”)

 

The p-value approach is more general (and more widely used) than the critical-value approach (shown in the book). Based on the p-value, you can make a decision for any value of . It’s important you understand the p-value approach.

Example

What is the normal body temperature? A 1992 JAMA article suggests the average body temperature might be less than 98.6 degrees Fahrenheit. We can test this theory via a significance test. Suppose the population of body temperatures follows a normal distribution. Let  denote the average body temperature for the population, and assume the standard deviation of body temperatures is  = 0.8 degrees Fahrenheit. (Note: This is an unrealistic assumption—if we don’t know the population mean, we won’t know the population standard deviation. We’ll get to the more realistic case of unknown  later.) The researcher is convinced that if the status-quo average of 98.6 degrees is not true, then the real average is lower. (Note: The alternative hypothesis always comes from the researcher. This is an example of a one-sided test. There are examples of two-sided tests in the textbook.)

 

Suppose a random sample of 40 adults is taken (this would be difficult to do) and their body temperatures are measured. From our particular sample,  = 98.2 degrees Fahrenheit. Use these data to conduct a significance test.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Now let’s consider the second type of error in significance testing: A Type II error is when we accept  when, in fact, it’s actually false. The probability of a Type I error is denoted . Note this probability is actually a conditional probability: . Note that  is controlled (read: chosen) by the researcher, but  is not always controlled. Therefore, if p-value > , we typically say “we do not have enough evidence to reject the null hypothesis,” rather than “we accept the null hypothesis as true” (because the probability of a Type II error might be high, so we protect ourselves with this language).

 

The power of a statistical test is the probability we reject  when it’s actually false. Note . A powerful test is desirable (gold standard: power ). How can we calculate the power? First we must set a desired level for  and choose a specific value of the parameter in the alternative hypothesis.

 

Example

Think about the court system in the United States (“innocent until proven guilty”). Within this context, define the null and alternative hypothesis, p-value, Type I error, Type II error, and power.


 

Example (Power Calculation)

Reconsider the body temperature example. Suppose we want to test at the  level. Furthermore, we want to determine the power of detecting a true average temperature of 98.4 degrees Fahrenheit (this is the specific value in the alternative hypothesis). Determine the power of the test. (And think about the relationship between the power and 1) the significance level, 2) the sample size, and 3) the specific value in the alternative hypothesis.)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Example (Sample-Size Determination for a Specific Power)

It can be a proactive step to first discuss (with the researcher) the desired “deviation from status quo” to be detected and the desired significance level. Then a sample size can be determined (before any data are collected). Again, reconsider the body temperature example. Suppose the research wants to test at the  level. Furthermore, she wants a power of 0.8 of detecting a true mean of 98.3 degrees Fahrenheit (that is, a deviation of 0.3 degrees below the null-hypothesized value). How large of a sample should she take?


 

Significance Test for a Population Mean (Population standard deviation known)

Suppose we have a random sample from a normal population with unknown mean, , but known standard deviation, . Furthermore, suppose we want to test the null hypothesis, . Then we first calculate the test statistic,  , which tells us the number of standard errors our particular sample average is from the null-hypothesized population mean. Then we can use the standard normal distribution to determine the p-value (recall the p-value depends on the direction of the alternative hypothesis):

 

 

Finally, we define and interpret the p-value in the context of the problem and provide a conclusion (which might depend on a given significance level, ). Note: By the Central Limit Theorem, if n is “large,” we can still use this test, even if our sample isn’t from a normal distribution.

 

Relationship between Confidence Interval and Significance Test:

A level , two-sided significance test rejects the hypothesis   exactly when the value  falls outside the   confidence interval for. (Put another way, the significance test does not reject the hypothesis if the value  falls inside the corresponding confidence interval.) This relationship also holds for one-sided tests and one-sided confidence bounds.

 

More Explanation on the Relationship between Confidence Interval and Significance Test

For a more mathematical explanation of the previous result, consider a two-sided test using significance level . Then the “acceptance region” (really the “do-not-reject region”) for the test statistic is between -1.96 and 1.96. Then (via simple algebra),

which says the value of  is inside the 95% confidence interval.

 

 

Practical Significance versus Statistical Significance

It’s possible for test results to be statistically significant, yet not practically significant. For example, you might find a statistically significant difference in means (i.e., you can reject the null hypothesis that the means are the same), yet in the context for the problem the difference might not be practically important. (For example, perhaps you find a significant difference in average decrease in cholesterol for patients taking a drug versus patients taking a placebo, but the magnitude of the difference in only 5 mg/dL. Doctors probably won’t find this practically important—certainly not important enough to put their patients on that drug.)

 

Hence, if you find statistically significant test results, it’s a good idea to accompany your results with a corresponding confidence interval (to assess the practical significance—but realize it’s an expert in the field, not necessarily a statistician, who should assess the practical importance).

 

 

Test for a Population Proportion

The textbook discusses a large-sample test for a population proportion (based on the standard normal distribution), and a small-sample test (based on the binomial distribution). We do not often use this test in practice, because it’s uncommon to have a situation where there is a precise value of p we want to test. Hence, when doing inference about a single population proportion, we’ll stick with a confidence interval. (That is, you can skip textbook Section 9.3. You can also skip Section 9.4—which discusses more esoteric ideas about hypothesis testing—with the exception of the idea of practical significance, which you must know.)

What is we don’t know the population standard deviation? (This is the much more realistic setting.) Then we can use the sample standard deviation to estimate the population standard deviation, and our test statistic has a t-distribution (not z-distribution).

 

Significance Test for a Population Mean (Population standard deviation unknown)

Suppose we have a random sample from a normal population with unknown mean, ,, and with unknown standard deviation. Before using the t-procedures for inference, we must check the condition that the population values follow a normal distribution. We can estimate the population distribution using an appropriate graph of our sample data values. If the distribution of our sample looks mound-shaped, then the condition has been met, and we can continue with our analysis. If the sample-data distribution deviates slightly from normality, then we can still use the t-procedures (these procedures are “robust” in that the probability calculations required are insensitive to small violations of the required conditions). But if the sample size is small (n < 40) and the sample-data distribution looks very non-normal, then the t-procedures should not be used.

 

If we want to test the null hypothesis, . Then we first calculate the test statistic, , which tells us the number of standard errors our particular sample average is from the null-hypothesized population mean. Then we use the t-distribution with  degrees of freedom determine the approximate p-value (recall the p-value depends on the direction of the alternative hypothesis).

 

Finally, we define and interpret the p-value in the context of the problem and provide a conclusion (which might depend on a given significance level, ). And, if the results are statistically significant, we also consider the practical significance.

 

Type II error-rate and power calculations are more difficult when using the t-distribution. We’ll use Minitab to do these calculations (the only power calculations you must know how to “do by hand” are those for the one-sample z-test).

 

Relationship between Confidence Interval and Significance Test:

A level , two-sided significance test rejects the hypothesis   exactly when the value  falls outside the   confidence interval for. (Put another way, the significance test does not reject the hypothesis if the value  falls inside the corresponding confidence interval.) This relationship also holds for one-sided tests and one-sided confidence bounds.

 

 

Example

Excerpts from Textbook Problem 9.35:

The industry standard for the amount of alcohol poured into many types of drinks (e.g., gin for a gin-and-tonic, whiskey on the rocks) is 1.5 ounces. A sample of 8 bartenders (with at least five years of experience) was asked to pour rum for a rum and coke into a short, wide glass. The 8 pour amounts (in ounces) are summarized in the graphs and numerical summaries below.

 

 

Variable                N     Mean    StDev   Minimum      Q1   Median      Q3   Maximum

Rum Pours (in oz.)      8   1.8163   0.2105      1.48  1.6775    1.805  1.9775      2.16


 

Check of Conditions

Suppose this sample represents the population of all experienced bartenders. We want to use this sample to test hypotheses about the average amount of rum poured in rum and cokes. The population standard deviation is unknown, so we must estimate it with the sample standard deviation. Hence, we must use a t-test, rather than a z-test. But recall the t-test assumes the population being sampled from is normal. This is an assumption we must check by looking at the distribution of our sample data. With only 8 data points, it’s difficult to check the normality assumption. We can see that the sample data distribution is approximately symmetric, but because there are so few data points, we can’t see where things “pile up” (if the points pile up in the center or not). Based on the normality plot, though, it seems reasonable to assume the sample comes from a normal distribution. 

 

Statement of Hypotheses

 

 

 

 

 

Test Statistic Calculation

 

 

 

 

 

 

P-value Calculation

 

 

 

 

 

 

 

 

 

Conclusion

Assuming the average rum pour (for a rum and coke) for all experienced bartenders is 1.5 ounces, there is only a 0.004 chance of getting our particular sample average pour (1.82 ozs) or a more extreme average pour. Because our data are so unlikely, we have strong evidence that the average rum pour is not 1.5 ounces. (These results are statistically significant at all common significance levels.) These data actually point to an average rum pour that is greater than 1.5 ounces.

 

Practical Significance

We know that our sample average of 1.8163 ounces is statistically significantly different from the industry standard of 1.5 ounces (as shown above). But is it practically significant? To better answer this question, we can create a 95% confidence interval for the population mean rum pour (note that ):

 

We are 95% confident that the average rum pour (for a rum and coke) for all experiences bartenders is between 1.64 ounces and 1.99 ounces. [As a side note, notice that the 95% confidence interval doesn’t contain the null-hypothesized value of 1.5 ounces, which agrees with our previous results—that the results are definitely significant at the 0.05 level.] In the context of the problem, is this range of values practically different from 1.5 ounces? That is, will the bar owners or customers really care? Maybe, but we should ask the experts (e.g., bar owners) to answer this question (e.g., perhaps this means a lot of lost money, practically speaking).