Math 207—Data Collection
Observation versus
Experimentation:
·
An
observational study observes
individuals and measures variables of interest, but does not attempt to
influence the responses (allows for confounding to occur).
·
An
experiment deliberately imposes some
treatment on individuals in order to observe their responses.
·
Unlike
an observational study, a well-designed experiment can show a causal link
between variables.
Principles of Good
Experimental Design:
·
Control the effects of confounding
variables on the response, most simply by comparing at least two treatments.
·
Randomly assign experimental
units to treatments (to reduce bias).
·
Replicate each treatment on many
units to reduce chance variation in the results.
Note: There are many, many
different types of experimental designs and analyses of those designs.
Sometimes,
rather than linking one variable to another (as in an experiment), we simply
want to take some measurement on a population. If the population is large, we
may need to make inference about the population via a sample.
The
population is the entire collection
of individuals about which we want information.
The
sample is the collection of
individuals we actually measure.
Potential Problems with
Sampling/Data Collection:
·
Voluntary
response (a voluntary response sample is often biased because people with
strong opinions, especially negative opinions, are more likely to respond)
·
Undercoverage (when some groups in
the population are left out of the process of choosing a sample)
·
Nonresponse (e.g., person chooses not
to answer, person isn’t home)
·
Response
error (e.g., person lies or remembers incorrectly)
·
Wording
of questions/Interview process (e.g., leading questions, prompting by
interviewer)
·
Processing
error (e.g., data entry error, misrecording on the
form)
Clarifying Notes:
·
The
first two bullet points above indicate problems with the process of choosing a
sample. The last four bullet points indicate problems with the process of data
collection.
·
Students
sometimes confuse the issues of undercoverage and nonresponse. Undercoverage occurs
when some people are left out of the process of choosing sampling (e.g., people
without phones are excluded). Nonresponse occurs if
someone who is meant to be sampled is not contacted (or refuses contact). That
is, undercoverage is a problem with the process of choosing a sample, and nonresponse is a problem with the actual process of data collection.
·
Not
all problems with sampling will necessarily lead to bias. For example, if undercoverage occurs, but the people left out share nothing
in common that affects their responses, there may not
be a bias. You need to make the case that
a sampling problem will lead to bias.
A
simple random sample of size n is a
sample chosen such that all groups of size n
have the same chance of being the selected sample.
·
This
is the gold standard, but it is sometimes difficult to do in practice.
·
This
solves the problems of undercoverage, but doesn’t
solve the other problems (e.g., even if you select a simple random sample,
you’re not guaranteed to get information from everyone in the sample).
·
Some
situations call for more complex sampling plans (e.g. stratified random
sampling).