Math 217—Introduction to Categorical Analysis

 

When analyzing categorical data, we follow the same principles as when analyzing quantitative variables: 1) First look appropriately and descriptively at the data, and then 2) perform a significance test, if needed (and appropriate). A simple two-way table with appropriate (to your question) marginal and conditional percentages is the best place to start.

 

Suppose the government releases data about patient outcomes in hospitals. You want to compare Hospital A and Hospital B, which both serve your community. Particularly, you are interested in the survival of patients after surgery in these two hospitals. The raw data are shown below.

 

 

Hospital

 

Post-Surgery Status

A

B

Total

Death

63

16

79

Survival

2037

784

2821

Total

2100

800

2900

 

From the table, we know we have information on 2900 patients. The total columns give us the marginal distribution of the two separate variables. The inner-cells of the table give us the joint distribution of the two variables together.

 

In this case we’re particularly interested in the column percentages (which are conditional percentages). That is, what are the different death and survival rates for the two different hospitals? (Note, we must think carefully about what numerical summaries best answer the question of interest.)

 

 

Hospital

Post-Surgery Status

A

B

Death

63 (3%)

16 (2%)

Survival

2037 (97%)

784 (98%)

Total

2100

800

     (percentages given are column percents—conditional on the hospital)

 

In this case it looks like Hospital B is better for surgery (higher survival rate). But can you think of another variable that might affect the post-surgery status of a patient (there are probably many)? How about the condition of the patient before surgery?

 

Suppose we now have information on the patient’s status (good condition or poor condition) before entering surgery, and we create separate two-way tables for the patient status:

 

                                                Good-Condition Patients                  Bad-Condition Patients

 

Hospital

 

Hospital

Post-Surgery Status

A

B

 

A

B

Death

6 (1%)

8 (1.3%)

 

57 (3.8%)

8 (4%)

Survival

594 (99%)

592 (98.7%)

 

1443 (96.2%)

192 (96%)

Total

600

600

 

1500

200

      (percentages given are column percents—conditional on the hospital)

 

Notice that Hospital A actually has higher survival rates than Hospital B for both sets of patients. This indicates that Hospital A is the better bet for surgery. Note that when we ignored the confounding variable of patient condition, we thought Hospital B was better. With the more thorough analysis, we see that Hospital A is actually better for surgery.

 

How can hospital A do better within each group of patients yet do worse overall? This is called Simpson’s Paradox. Hospital A attracts more bad-condition patients (probably because it’s known as a good hospital), and bad-condition patients are more likely to die in surgery (this makes Hospital A’s overall percentage look worse than Hospital B’s). This is a good example of why all relevant variables should be considered in the analysis.

 

Next, we’ll discuss the inference in two-way tables.