Elementary Statistics—Sampling, Data
Collection, and Foreshadow of Inference (things to consider)
(Note: This looks strangely like a “lecture,”
something I said I wouldn’t do. But this handout then allows us class time to
discuss these issues based on real-world examples. I will share these examples
with you and we’ll also do an activity to solidify some of these sampling
ideas. This handout is required reading
for the course.)
The population is the entire collection of
individuals about which we want information. The sample is the collection of individuals we actually measure. Eventually
we will use the sample to make generalizations about the population, so it’s very important for the sample to be
representative (of the whole population) and not biased. Much thought, time,
and effort should be put in the sampling and data-collection process. It can be
quite challenging to obtain a sample that well-represents the population, yet
this is a vital requirement of inference.
Potential Problems with Sampling/Data
Collection
·
Voluntary response (a voluntary-response sample is almost
always biased because people with strong opinions, especially negative
opinions, are more likely to respond)
·
Undercoverage (when some groups in the population are
left out of the process of choosing a sample—e.g., the population is all US
adults and the sampling method is an email survey sent to a group of randomly
selected email addresses; those without an email address cannot participate;
they are left out of the process of selecting a sample)
·
Nonresponse (e.g.,
person chooses not to answer, person isn’t home)
·
Response error (e.g.,
person lies or remembers incorrectly)
·
Wording of the question/Interview
process/Ordering of questions
(e.g., leading questions, prompting
by interviewer, certain order of questions to prompt a desired response)
·
Processing error (e.g.,
data entry error, mis-recording on the form)
Important Notes
·
Undercoverage
occurs when some people are left out of the process of choosing sampling (e.g., people without phones are
excluded). Nonresponse occurs if someone who is meant
to be sampled is not contacted (or refuses contact). That is, undercoverage is a problem with the process of choosing a sample, and nonreponse is a problem with the actual process of data collection.
·
Not
all problems with sampling will necessarily lead to bias. For example, if undercoverage occurs, but the people left out share nothing
in common that affects the response, there may not be a bias. You need to make the case that a sampling problem
will lead to bias.
Simple Random Sample
A simple random sample of size n is a sample chosen such that all
groups of size n have the same change
of being the selected sample. (This can be done using a random number table or
statistical software.)
·
This
is the gold standard, but it is sometimes difficult to do in practice.
·
This
solves the problem of undercoverage, but doesn’t
solve the other problems (e.g., even
if you select a simple random sample, you’re not guaranteed to get information
from everyone in the sample).
·
Some
situations call for more complex sampling plans. (For example, stratified
sampling, similar to a block experimental design, selects separate random
samples for each stratum.)
Bottom Line
It’s often
difficult to obtain a sample that well-represents a population. But it’s
important to work hard, thoroughly, and thoughtfully to get representative (if
not random) data. In the end, you might have to work with non-ideal, yet
reasonable data (but mention all the caveats—for example, the limitations in
the scope of your inference).
Moving Toward Inference
A
parameter is a number that describes
the population (e.g., mean, median, proportion). In most (all?) real-world
situations, the parameter is unknown (it’s very time-consuming and expensive to
perform an accurate census of a population, and that process is often filled
with errors). But what if we really, really want to know this parameter (e.g.,
it’s central to a research question)? A statistic
is a number that describes a sample (e.g., sample mean, sample median, sample
proportion). This statistic can be used to estimate the unknown parameter (yay!). But the statistic can change from sample to sample
(clearly there is variability in the different samples). We need to understand
how this statistic changes from sample to sample.
The
sampling distribution of a statistic
is the distribution of values taken by the statistic in all possible samples of
the same size from the same population. Think carefully about this
definition—it contains much information (students often think they understand
sampling distributions when, in fact, they do not; this is a slippery idea).
It’s most important that you fully understand the idea of a sampling
distribution. Note we do not actually do the repeated sampling—we rely on
theory or simulation for this—but you must understand this concept before you
can understand inference.
It
seems natural to use the sample mean,
, to estimate the
population mean,
(if the sample mean comes from a random—or
very representative—sample). We must know the sampling distribution of
before we can make inference about the
population mean. (We’ll discuss this in much more detail later in the course.)
Example 1
Consider
a hypothetical (infinite) population of coin flips. If the coin is fair, we
know the probability of heads:
. (Typically, we don’t
know the parameter, but for this example—to illustrate properties of
estimators—we do.) Our statistic is the sample proportion of heads based on 50
flips of this coin. In this case, our sample size is 50 and our population is
the infinite number of coin flips. Suppose we repeat the 50 flips (that is,
repeat the sampling) many, many times, each time determining the sample
proportion. The first graph below shows the sample proportion of heads based on
many samples of size 50 (this is an estimate of the sampling distribution, not
the full sampling distribution). Note
this graph is NOT a distribution of a single sample. It is the distribution of
proportions, each determined from a separate sample of size 50.


·
What
properties do you notice about the sampling distribution of the sample
proportion (based on 50 coin flips)?
·
The
second graph compares two sampling distribution of the proportion—the first
sampling distribution is based on samples of size 50 and the second sampling
distribution is based on samples of size 500. What’s the most striking feature
of this graph?
Example 2
Recall
the sampling activity we did in lab (three different methods used to estimate
the total area of the 100 “houses”). Included below is a graph—based on data
from previous classes—that compares the distributions of estimates. Which
method produces unbiased estimation?
