Elementary Statistics—Sampling, Data Collection, and Foreshadow of Inference (things to consider)

(Note: This looks strangely like a “lecture,” something I said I wouldn’t do. But this handout then allows us class time to discuss these issues based on real-world examples. I will share these examples with you and we’ll also do an activity to solidify some of these sampling ideas. This handout is required reading for the course.)

 

The population is the entire collection of individuals about which we want information. The sample is the collection of individuals we actually measure. Eventually we will use the sample to make generalizations about the population, so it’s very important for the sample to be representative (of the whole population) and not biased. Much thought, time, and effort should be put in the sampling and data-collection process. It can be quite challenging to obtain a sample that well-represents the population, yet this is a vital requirement of inference.

 

 

Potential Problems with Sampling/Data Collection

·         Voluntary response (a voluntary-response sample is almost always biased because people with strong opinions, especially negative opinions, are more likely to respond)

 

·         Undercoverage (when some groups in the population are left out of the process of choosing a sample—e.g., the population is all US adults and the sampling method is an email survey sent to a group of randomly selected email addresses; those without an email address cannot participate; they are left out of the process of selecting a sample)

 

·         Nonresponse (e.g., person chooses not to answer, person isn’t home)

 

·         Response error (e.g., person lies or remembers incorrectly)

 

·         Wording of the question/Interview process/Ordering of questions (e.g., leading questions, prompting by interviewer, certain order of questions to prompt a desired response)

 

·         Processing error (e.g., data entry error, mis-recording on the form)

 

 

Important Notes

·         Undercoverage occurs when some people are left out of the process of choosing sampling (e.g., people without phones are excluded). Nonresponse occurs if someone who is meant to be sampled is not contacted (or refuses contact). That is, undercoverage is a problem with the process of choosing a sample, and nonreponse is a problem with the actual process of data collection.

 

·         Not all problems with sampling will necessarily lead to bias. For example, if undercoverage occurs, but the people left out share nothing in common that affects the response, there may not be a bias. You need to make the case that a sampling problem will lead to bias.

 

 

Simple Random Sample

A simple random sample of size n is a sample chosen such that all groups of size n have the same change of being the selected sample. (This can be done using a random number table or statistical software.)

·         This is the gold standard, but it is sometimes difficult to do in practice.

 

·         This solves the problem of undercoverage, but doesn’t solve the other problems (e.g., even if you select a simple random sample, you’re not guaranteed to get information from everyone in the sample).

 

·         Some situations call for more complex sampling plans. (For example, stratified sampling, similar to a block experimental design, selects separate random samples for each stratum.)

 

 

 

Bottom Line

It’s often difficult to obtain a sample that well-represents a population. But it’s important to work hard, thoroughly, and thoughtfully to get representative (if not random) data. In the end, you might have to work with non-ideal, yet reasonable data (but mention all the caveats—for example, the limitations in the scope of your inference).


 

Moving Toward Inference

A parameter is a number that describes the population (e.g., mean, median, proportion). In most (all?) real-world situations, the parameter is unknown (it’s very time-consuming and expensive to perform an accurate census of a population, and that process is often filled with errors). But what if we really, really want to know this parameter (e.g., it’s central to a research question)? A statistic is a number that describes a sample (e.g., sample mean, sample median, sample proportion). This statistic can be used to estimate the unknown parameter (yay!). But the statistic can change from sample to sample (clearly there is variability in the different samples). We need to understand how this statistic changes from sample to sample.

 

The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. Think carefully about this definition—it contains much information (students often think they understand sampling distributions when, in fact, they do not; this is a slippery idea). It’s most important that you fully understand the idea of a sampling distribution. Note we do not actually do the repeated sampling—we rely on theory or simulation for this—but you must understand this concept before you can understand inference.

 

It seems natural to use the sample mean, , to estimate the population mean,  (if the sample mean comes from a random—or very representative—sample). We must know the sampling distribution of  before we can make inference about the population mean. (We’ll discuss this in much more detail later in the course.)

 

Example 1

Consider a hypothetical (infinite) population of coin flips. If the coin is fair, we know the probability of heads: . (Typically, we don’t know the parameter, but for this example—to illustrate properties of estimators—we do.) Our statistic is the sample proportion of heads based on 50 flips of this coin. In this case, our sample size is 50 and our population is the infinite number of coin flips. Suppose we repeat the 50 flips (that is, repeat the sampling) many, many times, each time determining the sample proportion. The first graph below shows the sample proportion of heads based on many samples of size 50 (this is an estimate of the sampling distribution, not the full sampling distribution). Note this graph is NOT a distribution of a single sample. It is the distribution of proportions, each determined from a separate sample of size 50.

·         What properties do you notice about the sampling distribution of the sample proportion (based on 50 coin flips)?

·         The second graph compares two sampling distribution of the proportion—the first sampling distribution is based on samples of size 50 and the second sampling distribution is based on samples of size 500. What’s the most striking feature of this graph?

 

Example 2

Recall the sampling activity we did in lab (three different methods used to estimate the total area of the 100 “houses”). Included below is a graph—based on data from previous classes—that compares the distributions of estimates. Which method produces unbiased estimation?