Elementary Statistics Solutions
– One-variable Graphics and Numerical Summaries
-
- Shown
below are two possible stem-and-leaf plots. (Note that splitting the
stems twice—leaves 0-4 on one line and leaves 5-9 on another—does not
spread out the distribution enough to make a meaningful graph.)
Reported ACT Scores of Math
117 Students (n = 20)
Leaf Unit = 0.10
25
| 0 0
26
| 0 0
27
| 0
28
| 0 0 0 0 0 0
29
| 0 0 0 0
30
| 0 0
31
| 0 0
32
| 0
Reported ACT Scores of Math
117 Students (n = 20)
Leaf Unit = 1.0
2
| 5 5
2
| 6 6 7
2
| 8 8 8 8 8 8 9 9 9 9
3
| 0 0 1 1
3
| 2
- The
distribution of reported ACT scores is approximately symmetric around a
score of 28. Most of the scores are concentrated in the range 28 – 29,
with lower and higher “tails” extending to 25 and 32, respectively.
- It
is easy to use the stem-and-leaf plot to determine the five-number
summary, since the individual observations are shown in the plot and they
are ordered. The median is in the (20+1)/2 = 10.5th position,
so the median is (28 + 28)/2 = 28 (note the median is not an actual value
in the data set—it is between two values). Now there is a lower half of
the data set that contains 10 values and an upper half of the data set
that contains 10 values. The median of a set of 10 ordered observations
is in the (10+1)/2 = 5.5th position. So the first quartile is 27.5
and the third quartile is 29.5. Hence, the five-number summary is (25, 27.5,
28, 29.5, 32).
Other important notes: If there were units attached to the variable
values (say, points), then those units should be included in the five-number
summary. Furthermore, if the median
were an actual value in the data set, it would be excluded when determining the
first and third quartiles.
- The
interquartile range (IQR) is 29.5 – 27.5 = 2. Then 1.5 times the IQR is 3.
An observation is considered a suspected outlier if it’s smaller than 27.5
– 3 = 24.5 or if it’s larger than 29.5 + 3 = 32.5. None of the reported
ACT scores is smaller than 24.5 or larger than 32.5. Hence, there are no
suspected outliers.
- If
an outlier is detected, it should not necessarily be eliminated from the
data set. If the data value was simply misrecorded, then that’s an easy
change that should be made. If the data value isn’t a “true observation”
(e.g., someone put their name on
their ACT exam, but then got terribly ill and had to leave without
answering any questions, so he/she got a 0; or if a response came from an
experiment where appropriate variables weren’t controlled), then it can
justifiably be omitted from the data set. If the data value is a true
observation and it’s an outlier, then it can’t justifiably be omitted and
it should be included in the data analysis. One way of dealing with
extreme outliers is to do appropriate analyses both with and without the
outlier included and report both sets of results.
- A
boxplot of the ACT scores is shown below. There are no suspected
outliers, but if there were they would be denoted with asterisks apart
from the whiskers, and then the whiskers would extend to the largest—or
smallest—value that isn’t an outlier. (This graph was created by Minitab,
a statistical software package, and Minitab uses a slightly different
method to calculate the quartiles. Hence, the quartiles in this boxplot
are slightly different from the ones calculated above.)

- The
bar chart shows the raw counts of males and females who prefer each type
of peanut butter. There were more females (18) than males (8) in Math 117,
though, so the bars for females are almost always taller than the bars for
males. This creates a potentially misleading graph. It would be better to
have the bars show the percentage of males and females who prefer the different
types of peanut butter, as shown below.

-
- The
distributions are both symmetric and they both balance at the same point.
Since the mean is the balance point of the graphed distribution, the
means for both classes must be the same.
- The
standard deviation measures the spread of the distribution around the
mean. The scores for Section A are more concentrated around the mean,
while the scores for Section B are more spread out. Hence, the standard
deviation of the scores for Section B must be larger.
- In
this case, both sets of exam scores have the same range. So if we used
the range as a measure of spread, we would think the two sets of exam
scores had the same amount of spread. Visually (and as measured by the
standard deviation), it seems obvious that the scores for Section B have
more spread. Hence, the standard deviation is a better measure of spread.
- This
is a graph of the monetary amount (in dollars) of carried coin money. This
variable has a natural boundary, as someone cannot carry a negative amount
of coins. Furthermore, most people carry no coins or only a small number
of coins, while a few people carry a lot of coins. So the tail of the
distribution can extend out to the right, but it can’t extend to the left
(because of the $0 boundary).
Why are the other answers
incorrect?
·
The hours of sleep on a typical weeknight does
have natural boundaries on both sides (0 on the low end and practically
speaking, say 12, on the high end), but it’s not clear that most of the
observations would clump at one of those boundaries. It makes more sense for
the distribution of this variable to be much less skewed than the graph shown.
·
If students truly choose a random integer, then
the distribution of values should be uniform, rather than severely skewed.
Humans are incapable, though, of choosing a random number, hence it’s very
possible that the distribution wouldn’t be uniform. That said, a huge majority
of students probably wouldn’t choose a low number (e.g., 0) as their “random” number. Typically, the most common
“random” integer chosen by students is 7.
·
Heights of people typically follow a normal
distribution (there are natural boundaries for height, but the typical values
of height fall far enough away from the boundaries that tails can form on both
sides of the distribution). Since female and male heights were both recorded in
Math 117, it’s possible the graph of the height variable would be bi-modal. Neither
of these shapes is displayed in the graph shown.
-
- When
answering this question, the different sample sizes must be taken into
consideration. That is, because the class sizes are different, we can’t
simply average the two averages. To find the overall mean, we must find
the total score for both classes combined and then divide by 50:

Note this is simply a weighted
average.
- The
appropriate linear transformation is
. The mean is affected by both the additive constant
and the multiplicative constant, so the new mean is
. The standard deviation is only affected by the
multiplicative constant, so the new standard deviation is
.
- Consider
Class A first. The standard deviation is only affected by the multiplier,
b, so the change needed in the
standard deviation defines b:
. The mean is affected by both a and b, so we can
determine the value of a based
on the change we want in the mean:
. Hence, for Class A the appropriate linear
transformation is
.
For Class B, we first find the
value of b:
. Then we find the value of a:
. Hence, for Class B the appropriate linear transformation is
.
- There
are many errors with Bubba’s stem-and-leaf plot:
·
The leaves are not ordered.
·
No leaf unit is given.
·
The “5” stem is not included.
·
The graph isn’t titled.
- Bubba
should not create a histogram of home states, as this is a categorical
variable. It only makes sense to create a histogram if the variable is
continuous (e.g., GPA and
height). The home states could be shown in a bar chart, but not a
histogram.
- Listed
below are the reasons why each statement is correct or incorrect:
·
The median
maximum speed for steel roller coasters is 50 mph. Because the distribution is
obviously skewed right, the mean will actually be larger than 50 mph (but we
can’t know the exact value from the boxplot).
·
Because the first quartile is closer to the
median than the third quartile is (and there are the same number of
observations in the range
as in the range
), we know the observations are more densely packed at the
low end of the distribution and more spread out at the high end of the
distribution. Hence, the distribution is skewed
right. (We can also see this from the long right tail of the boxplot.)
·
This statement is correct, and can be seen by
simply comparing the vertical lines in the boxes.
·
The boxplots give no information about how many
of each coaster were sampled. Hence we can not
tell there are more steel coasters.
·
The median maximum speed for the steel coasters
is 50 mph, so 50% of these coasters have maximum speeds above 50 mph. The first
quartile for the wooden coaster speeds is 50 mph, so 75% of these coasters have
maximum speeds above 50 mph. Hence, a lower
percentage of steel coasters have maximum speeds above 50 mph.
- The
distribution of a random sample of incomes from the United States
is most likely be skewed right. There is a natural boundary of $0 as an
income (people can’t make a negative amount of money). Also, many people
make a small or moderate amount of money, yet a very few make an obscenely
large amount of money, which creates a long right tail (because of the
boundary at $0, there can’t be an equivalent left tail). Because the
distribution is skewed right, the mean income will be greater than the median income (the mean is pulled up by the
long right tail). Furthermore, the five-number
summary is a better numerical summary for skewed distributions. (The
mean and standard deviation should be used for symmetric distributions.)