Math 217—Introduction to Categorical
Analysis
When
analyzing categorical data, we follow the same principles as when analyzing
quantitative variables: 1) First look appropriately and descriptively at
the data, and then 2) perform
a significance test, if needed (and appropriate). A simple two-way table with
appropriate (to your question) marginal and conditional percentages is the best
place to start.
Suppose the
government releases data about patient outcomes in hospitals. You want to
compare Hospital A and Hospital B, which both serve your community.
Particularly, you are interested in the survival of patients after surgery in
these two hospitals. The raw data are shown below.
|
|
Hospital |
|
|
|
Post-Surgery Status |
A |
B |
Total |
|
Death |
63 |
16 |
79 |
|
Survival |
2037 |
784 |
2821 |
|
Total |
2100 |
800 |
2900 |
From the
table, we know we have information on 2900 patients. The total columns give us
the marginal distribution of the two
separate variables. The inner-cells of the table give us the joint distribution of the two variables
together.
In this case
we’re particularly interested in the column percentages (which are conditional percentages). That is, what
are the different death and survival rates for the two different hospitals? (Note, we must think carefully about what
numerical summaries best answer the question of interest.)
|
|
Hospital |
|
|
Post-Surgery Status |
A |
B |
|
Death |
63
(3%) |
16
(2%) |
|
Survival |
2037
(97%) |
784
(98%) |
|
Total |
2100 |
800 |
(percentages given are column percents—conditional on the
hospital)
In this case
it looks like Hospital B is better for surgery (higher survival rate). But can
you think of another variable that might affect the post-surgery status of a
patient (there are probably many)? How about the condition of the patient
before surgery?
Suppose we
now have information on the patient’s status (good condition or poor condition)
before entering surgery, and we create separate two-way tables for the patient
status:
Good-Condition
Patients Bad-Condition
Patients
|
|
Hospital |
|
Hospital |
||
|
Post-Surgery Status |
A |
B |
|
A |
B |
|
Death |
6
(1%) |
8
(1.3%) |
|
57
(3.8%) |
8
(4%) |
|
Survival |
594
(99%) |
592
(98.7%) |
|
1443
(96.2%) |
192
(96%) |
|
Total |
600 |
600 |
|
1500 |
200 |
(percentages given are column percents—conditional on the
hospital)
Notice that
Hospital A actually has higher survival rates than Hospital B for both sets of
patients. This indicates that Hospital A is the better bet for surgery. Note
that when we ignored the confounding variable of patient condition, we thought
Hospital B was better. With the more thorough analysis, we see that Hospital A
is actually better for surgery.
How can
hospital A do better within each group of patients yet
do worse overall? This is called Simpson’s
Paradox. Hospital A attracts more bad-condition patients (probably because
it’s known as a good hospital), and bad-condition patients are more likely to
die in surgery (this makes Hospital A’s overall percentage look worse than
Hospital B’s). This is a good example of why all relevant variables should be
considered in the analysis.
Next, we’ll discuss the inference in
two-way tables.