Required Course Reading:

Math 207 Solutions – One-variable Graphics and Numerical Summaries

 

  1.  
    1. Shown below are two possible stem-and-leaf plots. (Note that splitting the stems twice—leaves 0-4 on one line and leaves 5-9 on another—does not spread out the distribution enough to make an effective graph.)

 

Reported ACT Scores of Math 207 Students (n = 20)

Leaf Unit = 0.10

 

22 | 0

23 | 0

24 | 0

25 |

26 |

27 | 0 0

28 | 0 0

29 | 0 0 0

30 | 0 0 0 0

31 | 0 0 0 0

32 | 0 0

 

 

Reported ACT Scores of Math 207 Students (n = 20)

Leaf Unit = 1.0

 

2 | 2 3

2 | 4

2 | 7 7

2 | 8 8 9 9 9

3 | 0 0 0 0 1 1 1 1

3 | 2 2

 

    1. The distribution of reported ACT scores is skewed to the left (i.e., to the low values). Most of the scores are centered around 30, but there is a longer left tail which extends down to 22.

 

    1. It is easy to use the stem-and-leaf plot to determine the five-number summary, since the individual observations are shown in the plot and they are ordered. The median is in the (20+1)(.5) = 10.5th position, so the median is (29 + 30)/2 = 29.5 (note the median is not an actual value in the data set—it is between two values). The first quartile is in the (20+1)(.25) = 5.25th position, so  (note we had to interpolate between two values). The third quartile is in the (20+1)(.75) = 15.75th position, so . Hence, the five-number summary is (22, 27.25, 29.5, 31, 32).

 

Important note: If there were units attached to the variable values (say, points), then those units should be included in the five-number summary.

 

    1. The interquartile range (IQR) is 31 – 27.25 = 3.75. Then 1.5 times the IQR is 5.625. An observation is considered a suspected outlier if it’s smaller than 27.25 – 5.625 = 21.625 or if it’s larger than 31 + 5.625 = 36.625. None of the reported ACT scores is smaller than 21.625 or larger than 36.625. Hence, there are no suspected outliers.

 

    1. If an outlier is detected, it should not necessarily be eliminated from the data set. If the data value was simply misrecorded, then that’s an easy change that should be made. If the data value isn’t a “true observation” (e.g., someone put their name on their ACT exam, but then got terribly ill and had to leave without answering any questions, so he/she got a 0; if a response came from an experiment where appropriate variables weren’t controlled), then it can be justifiably omitted from the data set. If the data value is a true observation and it’s an outlier, then it cannot justifiably be omitted and it should be included in the data analysis. One way of dealing with extreme outliers is to do appropriate analyses both with and without the outlier included and report both sets of results.
    2. A boxplot of the ACT scores is shown below. There are no suspected outliers in this distribution, but if there were they would be denoted by asterisks apart from the whiskers (and then the whiskers would extend to the largest—or smallest—value that isn’t an outlier). Note the boxplot also shows the left-skewed property of the distribution of scores.

 

 

 

  1.  
    1. The average age of all residents, 72 years, is a parameter, because it’s a number that describes a population. The average age of the 50 residents in the sample, 64 years, is a statistic, as it’s a number that describes a sample.

 

    1. The sample of 50 residents is certainly not a random sample from the population (anyone not at the health club was excluded from the sampling process, so not every resident had an equal chance of being selected). Most likely, the residents using the gym are in better health than the general population, and therefore may be younger than the general population. Hence, the sample average age is probably not a good estimate of the population average age in this case (the estimate is biased low, because of the flaw in the sampling process).

 

  1. For parts ac, it’s important to draw a picture as part of the solution (these solutions don’t include pictures, simply because Word can’t draw them). We will study normal distributions in much more detail later in the term.

 

    1. The value of 395 is one standard deviation below the mean. Since 68% of the observations are within one standard deviation of the mean, 32% are outside this range. Then by symmetry of a mound-shaped (i.e., bell-shaped, normal) distribution, 32/2=16% of the observations are below 395.

 

    1. By symmetry and the Empirical Rule, 95/2 = 47.5% of the observations are between the values 505 and 725, and 68/2=34% of the observations are between 505 and 615. Then 47.5 – 34 = 13.5% of the observations are between 615 and 725.

 

    1. Bubba scored a 405, and his z-score is . In words, this means Bubba’s score is 0.91 standard deviations below the mean score. Hence, his score is not unusual (by the Empirical Rule, we know more than 32% of the other scores are more extreme than Bubba’s.)

    2. It’s important to remember the Empirical Rule only applies to mound-shaped distributions. In this case, we don’t know the distribution is mound shaped, so we should use Tchebysheff’s rule to determine the percentage of observations within two standard deviations of the mean: at least , or 75%, of the scores are between 285 and 725.

 

Note: Both the Empirical Rule and Tchebysheff’s Rule tell us it’s unusual (although certainly not impossible) to have an observed value more than 3 standard deviations from the mean (and the Empirical Rule tells us it’s even unusual to have an observation more than 2 standard deviations from the mean).

 

  1. The bar chart shows the raw counts of males and females who prefer each type of peanut butter. There were more females (18) than males (8) in Math 117, though, so the bars for females are almost always taller than the bars for males. This creates a potentially misleading graph. It would be better to have the bars show the percentage of males and females who prefer the different types of peanut butter, as shown below.

 

 

  1.  
    1. The distributions are both symmetric and they both balance at the same point. Since the mean is the balance point of the graphed distribution, the means for both classes must be the same.

 

    1. The standard deviation measures the spread of the distribution around the mean. The scores for Section A are more concentrated around the mean, while the scores for Section B are more spread out. Hence, the standard deviation of the scores for Section B must be larger.

 

    1. In this case, both sets of exam scores have the same range. So if we used the range as a measure of spread, we would think the two sets of exam scores had the same amount of spread. Visually (and as measured by the standard deviation), it seems obvious that the scores for Section B have more spread. Hence, the standard deviation is a better measure of spread.

 

  1. This is a graph of the monetary amount (in dollars) of carried coin money. This variable has a natural boundary, as someone cannot carry a negative amount of coins. Furthermore, most people carry no coins or only a small number of coins, while a few people carry a lot of coins. So the tail of the distribution can extend out to the right, but it can’t extend to the left (because of the $0 boundary).

 

Why are the other answers incorrect?

 

·       The hours of sleep on a typical weeknight does have natural boundaries on both sides (0 on the low end and practically speaking, say 12, on the high end), but it’s not clear that most of the observations would clump at one of those boundaries. It makes more sense for the distribution of this variable to be much less skewed than the graph shown.

 

·       If students truly choose a random integer, then the distribution of values should be uniform, rather than severely skewed. Humans are incapable, though, of choosing a random number, hence it’s very possible the distribution wouldn’t be uniform. That said, a huge majority of students probably wouldn’t choose a low number (e.g., 0) as their “random” number. Typically, the most common “random” integer chosen by students is 7.

 

·       Heights of people typically follow a normal distribution (there are natural boundaries for height, but the typical values of height fall far enough away from the boundaries that tails can form on both sides of the distribution). Since female and male heights were both recorded, it’s possible the graph of the height variable would be bi-modal. Neither of these shapes is displayed in the graph shown.

 

  1. There are many errors with Bubba’s stem-and-leaf plot:

 

·      The leaves are not ordered.

·      No leaf unit is given.

·      The “5” stem is not included.

·      The graph isn’t titled.

 

 

  1. Bubba should not create a histogram of home states, as this is a categorical variable. It only makes sense to create a histogram if the variable is continuous (e.g., GPA and height). The home states could be shown in a bar chart, but not a histogram.

 

 

  1. Listed below are the reasons why each statement is correct or incorrect:

 

·       The median maximum speed for steel roller coasters is 50 mph. Because the distribution is obviously skewed right, the mean will actually be larger than 50 mph (but we can’t know the exact value from the boxplot).

 

·       Because the first quartile is closer to the median than the third quartile is (and there are the same number of observations in the range  as in the range ), we know the observations are more densely packed at the low end of the distribution and more spread out at the high end of the distribution. Hence, the distribution is skewed right. (We can also see this from the long right tail of the boxplot.)

 

·       This statement is correct, and can be seen by simply comparing the vertical lines in the boxes.

 

·       The boxplots give no information about how many of each coaster were sampled (it only shows percentiles). Hence we can not tell there are more steel coasters.

 

·       The median maximum speed for the steel coasters is 50 mph, so 50% of these coasters have maximum speeds above 50 mph. The first quartile for the wooden coaster speeds is 50 mph, so 75% of these coasters have maximum speeds above 50 mph. Hence, a lower percentage of steel coasters have maximum speeds above 50 mph.

 

 

  1. The distribution of a random sample of incomes from the United States is most likely be skewed right. There is a natural boundary of $0 as an income (people can’t make a negative amount of money). Also, many people make a small or moderate amount of money, yet a very few make an obscenely large amount of money, which creates a long right tail (because of the boundary at $0, there can’t be an equivalent left tail). Because the distribution is skewed right, the mean income will be greater than the median income (the mean is pulled up by the long right tail). Furthermore, the five-number summary is a better numerical summary for skewed distributions. (The mean and standard deviation should be used for symmetric distributions.)