Elementary Statistics Solutions – One-variable Graphics and Numerical Summaries

 

  1.  
    1. Shown below are two possible stem-and-leaf plots. (Note that splitting the stems twice—leaves 0-4 on one line and leaves 5-9 on another—does not spread out the distribution enough to make a meaningful graph.)

 

Reported ACT Scores of Math 117 Students (n = 20)

Leaf Unit = 0.10

 

25 | 0 0

26 | 0 0

27 | 0

28 | 0 0 0 0 0 0

29 | 0 0 0 0

30 | 0 0

31 | 0 0

32 | 0

 

 

Reported ACT Scores of Math 117 Students (n = 20)

Leaf Unit = 1.0

 

2 | 5 5

2 | 6 6 7

2 | 8 8 8 8 8 8 9 9 9 9

3 | 0 0 1 1

3 | 2

 

 

    1. The distribution of reported ACT scores is approximately symmetric around a score of 28. Most of the scores are concentrated in the range 28 – 29, with lower and higher “tails” extending to 25 and 32, respectively.

 

    1. It is easy to use the stem-and-leaf plot to determine the five-number summary, since the individual observations are shown in the plot and they are ordered. The median is in the (20+1)/2 = 10.5th position, so the median is (28 + 28)/2 = 28 (note the median is not an actual value in the data set—it is between two values). Now there is a lower half of the data set that contains 10 values and an upper half of the data set that contains 10 values. The median of a set of 10 ordered observations is in the (10+1)/2 = 5.5th position. So the first quartile is 27.5 and the third quartile is 29.5. Hence, the five-number summary is (25, 27.5, 28, 29.5, 32).

 

Other important notes: If there were units attached to the variable values (say, points), then those units should be included in the five-number summary. Furthermore, if the median were an actual value in the data set, it would be excluded when determining the first and third quartiles.

 

    1. The interquartile range (IQR) is 29.5 – 27.5 = 2. Then 1.5 times the IQR is 3. An observation is considered a suspected outlier if it’s smaller than 27.5 – 3 = 24.5 or if it’s larger than 29.5 + 3 = 32.5. None of the reported ACT scores is smaller than 24.5 or larger than 32.5. Hence, there are no suspected outliers.

 

    1. If an outlier is detected, it should not necessarily be eliminated from the data set. If the data value was simply misrecorded, then that’s an easy change that should be made. If the data value isn’t a “true observation” (e.g., someone put their name on their ACT exam, but then got terribly ill and had to leave without answering any questions, so he/she got a 0; or if a response came from an experiment where appropriate variables weren’t controlled), then it can justifiably be omitted from the data set. If the data value is a true observation and it’s an outlier, then it can’t justifiably be omitted and it should be included in the data analysis. One way of dealing with extreme outliers is to do appropriate analyses both with and without the outlier included and report both sets of results.

 

    1. A boxplot of the ACT scores is shown below. There are no suspected outliers, but if there were they would be denoted with asterisks apart from the whiskers, and then the whiskers would extend to the largest—or smallest—value that isn’t an outlier. (This graph was created by Minitab, a statistical software package, and Minitab uses a slightly different method to calculate the quartiles. Hence, the quartiles in this boxplot are slightly different from the ones calculated above.)

 

 

 

 

  1. The bar chart shows the raw counts of males and females who prefer each type of peanut butter. There were more females (18) than males (8) in Math 117, though, so the bars for females are almost always taller than the bars for males. This creates a potentially misleading graph. It would be better to have the bars show the percentage of males and females who prefer the different types of peanut butter, as shown below.

 

 

 

  1.  
    1. The distributions are both symmetric and they both balance at the same point. Since the mean is the balance point of the graphed distribution, the means for both classes must be the same.

 

    1. The standard deviation measures the spread of the distribution around the mean. The scores for Section A are more concentrated around the mean, while the scores for Section B are more spread out. Hence, the standard deviation of the scores for Section B must be larger.
    2. In this case, both sets of exam scores have the same range. So if we used the range as a measure of spread, we would think the two sets of exam scores had the same amount of spread. Visually (and as measured by the standard deviation), it seems obvious that the scores for Section B have more spread. Hence, the standard deviation is a better measure of spread.

 

 

  1. This is a graph of the monetary amount (in dollars) of carried coin money. This variable has a natural boundary, as someone cannot carry a negative amount of coins. Furthermore, most people carry no coins or only a small number of coins, while a few people carry a lot of coins. So the tail of the distribution can extend out to the right, but it can’t extend to the left (because of the $0 boundary).

 

Why are the other answers incorrect?

 

·       The hours of sleep on a typical weeknight does have natural boundaries on both sides (0 on the low end and practically speaking, say 12, on the high end), but it’s not clear that most of the observations would clump at one of those boundaries. It makes more sense for the distribution of this variable to be much less skewed than the graph shown.

 

·       If students truly choose a random integer, then the distribution of values should be uniform, rather than severely skewed. Humans are incapable, though, of choosing a random number, hence it’s very possible that the distribution wouldn’t be uniform. That said, a huge majority of students probably wouldn’t choose a low number (e.g., 0) as their “random” number. Typically, the most common “random” integer chosen by students is 7.

 

·       Heights of people typically follow a normal distribution (there are natural boundaries for height, but the typical values of height fall far enough away from the boundaries that tails can form on both sides of the distribution). Since female and male heights were both recorded in Math 117, it’s possible the graph of the height variable would be bi-modal. Neither of these shapes is displayed in the graph shown.

 

 

  1.  
    1. When answering this question, the different sample sizes must be taken into consideration. That is, because the class sizes are different, we can’t simply average the two averages. To find the overall mean, we must find the total score for both classes combined and then divide by 50:

 

Note this is simply a weighted average.

 

    1. The appropriate linear transformation is . The mean is affected by both the additive constant and the multiplicative constant, so the new mean is . The standard deviation is only affected by the multiplicative constant, so the new standard deviation is .

 

    1. Consider Class A first. The standard deviation is only affected by the multiplier, b, so the change needed in the standard deviation defines b: . The mean is affected by both a and b, so we can determine the value of a based on the change we want in the mean: . Hence, for Class A the appropriate linear transformation is .

 

For Class B, we first find the value of b: . Then we find the value of a: . Hence, for Class B the appropriate linear transformation is .

 

 

  1. There are many errors with Bubba’s stem-and-leaf plot:

 

·      The leaves are not ordered.

·      No leaf unit is given.

·      The “5” stem is not included.

·      The graph isn’t titled.

 

 

  1. Bubba should not create a histogram of home states, as this is a categorical variable. It only makes sense to create a histogram if the variable is continuous (e.g., GPA and height). The home states could be shown in a bar chart, but not a histogram.

 

 

  1. Listed below are the reasons why each statement is correct or incorrect:

 

·       The median maximum speed for steel roller coasters is 50 mph. Because the distribution is obviously skewed right, the mean will actually be larger than 50 mph (but we can’t know the exact value from the boxplot).

 

·       Because the first quartile is closer to the median than the third quartile is (and there are the same number of observations in the range  as in the range ), we know the observations are more densely packed at the low end of the distribution and more spread out at the high end of the distribution. Hence, the distribution is skewed right. (We can also see this from the long right tail of the boxplot.)

 

·       This statement is correct, and can be seen by simply comparing the vertical lines in the boxes.

 

·       The boxplots give no information about how many of each coaster were sampled. Hence we can not tell there are more steel coasters.

 

·       The median maximum speed for the steel coasters is 50 mph, so 50% of these coasters have maximum speeds above 50 mph. The first quartile for the wooden coaster speeds is 50 mph, so 75% of these coasters have maximum speeds above 50 mph. Hence, a lower percentage of steel coasters have maximum speeds above 50 mph.

 

 

  1. The distribution of a random sample of incomes from the United States is most likely be skewed right. There is a natural boundary of $0 as an income (people can’t make a negative amount of money). Also, many people make a small or moderate amount of money, yet a very few make an obscenely large amount of money, which creates a long right tail (because of the boundary at $0, there can’t be an equivalent left tail). Because the distribution is skewed right, the mean income will be greater than the median income (the mean is pulled up by the long right tail). Furthermore, the five-number summary is a better numerical summary for skewed distributions. (The mean and standard deviation should be used for symmetric distributions.)