Math 207—Data Collection

 

Observation versus Experimentation:

 

·       An observational study observes individuals and measures variables of interest, but does not attempt to influence the responses (allows for confounding to occur).

 

·       An experiment deliberately imposes some treatment on individuals in order to observe their responses.

 

·       Unlike an observational study, a well-designed experiment can show a causal link between variables.

 

Principles of Good Experimental Design:

 

·       Control the effects of confounding variables on the response, most simply by comparing at least two treatments.

 

·       Randomly assign experimental units to treatments (to reduce bias).

 

·       Replicate each treatment on many units to reduce chance variation in the results.

 

Note: There are many, many different types of experimental designs and analyses of those designs.

 

 

Sometimes, rather than linking one variable to another (as in an experiment), we simply want to take some measurement on a population. If the population is large, we may need to make inference about the population via a sample.

 

The population is the entire collection of individuals about which we want information.

The sample is the collection of individuals we actually measure.

 

Potential Problems with Sampling/Data Collection:

 

·       Voluntary response (a voluntary response sample is often biased because people with strong opinions, especially negative opinions, are more likely to respond) 

 

·       Undercoverage (when some groups in the population are left out of the process of choosing a sample)

 

·       Nonresponse (e.g., person chooses not to answer, person isn’t home)

 

·       Response error (e.g., person lies or remembers incorrectly)

 

·       Wording of questions/Interview process (e.g., leading questions, prompting by interviewer)

 

·       Processing error (e.g., data entry error, misrecording on the form)

 

Clarifying Notes:

 

·       The first two bullet points above indicate problems with the process of choosing a sample. The last four bullet points indicate problems with the process of data collection.

 

·       Students sometimes confuse the issues of undercoverage and nonresponse. Undercoverage occurs when some people are left out of the process of choosing sampling (e.g., people without phones are excluded). Nonresponse occurs if someone who is meant to be sampled is not contacted (or refuses contact). That is, undercoverage is a problem with the process of choosing a sample, and nonresponse is a problem with the actual process of data collection.

 

·       Not all problems with sampling will necessarily lead to bias. For example, if undercoverage occurs, but the people left out share nothing in common that affects their responses, there may not be a bias. You need to make the case that a sampling problem will lead to bias.

 

A simple random sample of size n is a sample chosen such that all groups of size n have the same chance of being the selected sample.

 

·       This is the gold standard, but it is sometimes difficult to do in practice.

 

·       This solves the problems of undercoverage, but doesn’t solve the other problems (e.g., even if you select a simple random sample, you’re not guaranteed to get information from everyone in the sample).

 

·       Some situations call for more complex sampling plans (e.g. stratified random sampling).