Required
Course Reading:
Math 207 Solutions
– One-variable Graphics and Numerical Summaries
-
- Shown
below are two possible stem-and-leaf plots. (Note that splitting the
stems twice—leaves 0-4 on one line and leaves 5-9 on another—does not spread out the distribution enough to make an
effective graph.)
Reported ACT Scores of Math 207
Students (n = 20)
Leaf Unit = 0.10
22
| 0
23
| 0
24
| 0
25
|
26
|
27
| 0 0
28
| 0 0
29
| 0 0 0
30
| 0 0 0 0
31
| 0 0 0 0
32
| 0 0
Reported ACT Scores of Math 207
Students (n = 20)
Leaf Unit = 1.0
2
| 2 3
2
| 4
2
| 7 7
2
| 8 8 9 9 9
3
| 0 0 0 0
1 1 1 1
3
| 2 2
- The
distribution of reported ACT scores is skewed to the left (i.e., to the low values). Most of
the scores are centered around 30, but there is
a longer left tail which extends down to 22.
- It
is easy to use the stem-and-leaf plot to determine the five-number
summary, since the individual observations are shown in the plot and they
are ordered. The median is in the (20+1)(.5) = 10.5th
position, so the median is (29 + 30)/2 = 29.5 (note the median is not an
actual value in the data set—it is between two values). The first
quartile is in the (20+1)(.25) = 5.25th
position, so
(note we had to
interpolate between two values). The third quartile is in the (20+1)(.75) = 15.75th position, so
. Hence, the five-number summary is (22, 27.25, 29.5, 31,
32).
Important note: If there were units attached to the variable values
(say, points), then those units should be included in the five-number summary.
- The interquartile range (IQR) is 31 – 27.25 = 3.75. Then
1.5 times the IQR is 5.625. An observation is considered a suspected
outlier if it’s smaller than 27.25 – 5.625 = 21.625 or if it’s larger
than 31 + 5.625 = 36.625. None of the reported ACT scores is smaller than
21.625 or larger than 36.625. Hence, there are no suspected outliers.
- If
an outlier is detected, it should not necessarily be eliminated from the
data set. If the data value was simply misrecorded,
then that’s an easy change that should be made. If the data value isn’t a
“true observation” (e.g.,
someone put their name on their ACT exam, but then got terribly ill and
had to leave without answering any questions, so he/she got a 0; if a
response came from an experiment where appropriate variables weren’t
controlled), then it can be justifiably omitted from the data set. If the
data value is a true observation and it’s an outlier, then it cannot
justifiably be omitted and it should be included in the data analysis. One
way of dealing with extreme outliers is to do appropriate analyses both
with and without the outlier included and report
both sets of results.
- A boxplot of the ACT scores is shown below. There are
no suspected outliers in this distribution, but if there were they would
be denoted by asterisks apart from the whiskers (and then the whiskers
would extend to the largest—or smallest—value that isn’t an outlier).
Note the boxplot also shows the left-skewed
property of the distribution of scores.

-
- The
average age of all residents, 72 years, is a parameter, because it’s a
number that describes a population. The average age of the 50 residents
in the sample, 64 years, is a statistic, as it’s a number that describes
a sample.
- The
sample of 50 residents is certainly not a random sample from the
population (anyone not at the health club was excluded from the sampling
process, so not every resident had an equal chance of being selected).
Most likely, the residents using the gym are in better health than the
general population, and therefore may be younger than the general
population. Hence, the sample average age is probably not a good estimate
of the population average age in this case (the estimate is biased low,
because of the flaw in the sampling process).
- For
parts a – c, it’s important to draw a picture as part of the solution
(these solutions don’t include pictures, simply because Word can’t draw
them). We will study normal distributions in much more detail later in the
term.
- The
value of 395 is one standard deviation below the mean. Since 68% of the
observations are within one standard deviation of the mean, 32% are
outside this range. Then by symmetry of a mound-shaped (i.e., bell-shaped, normal)
distribution, 32/2=16% of the observations are below 395.
- By
symmetry and the Empirical Rule, 95/2 = 47.5% of the observations are
between the values 505 and 725, and 68/2=34% of the observations are
between 505 and 615. Then 47.5 – 34 = 13.5% of the observations are
between 615 and 725.
- Bubba
scored a 405, and his z-score is
. In words, this means Bubba’s score is 0.91 standard
deviations below the mean score. Hence, his score is not unusual (by the
Empirical Rule, we know more than 32% of the other scores are more
extreme than Bubba’s.)
- It’s
important to remember the Empirical Rule only applies to mound-shaped
distributions. In this case, we don’t know the distribution is mound
shaped, so we should use Tchebysheff’s rule to
determine the percentage of observations within two standard deviations
of the mean: at least
, or 75%, of the scores are between 285 and 725.
Note: Both the Empirical Rule and Tchebysheff’s
Rule tell us it’s unusual (although certainly not impossible) to have an observed
value more than 3 standard deviations from the mean (and the Empirical Rule
tells us it’s even unusual to have an observation more than 2 standard
deviations from the mean).
- The
bar chart shows the raw counts of males and females who prefer each type of
peanut butter. There were more females (18) than males (8) in Math 117,
though, so the bars for females are almost always taller than the bars for
males. This creates a potentially misleading graph. It would be better to
have the bars show the percentage of males and females who prefer the different
types of peanut butter, as shown below.

-
- The
distributions are both symmetric and they both balance at the same point.
Since the mean is the balance point of the graphed distribution, the
means for both classes must be the same.
- The
standard deviation measures the spread of the distribution around the
mean. The scores for Section A are more concentrated around the mean,
while the scores for Section B are more spread out. Hence, the standard
deviation of the scores for Section B must be larger.
- In
this case, both sets of exam scores have the same range. So if we used
the range as a measure of spread, we would think the two sets of exam
scores had the same amount of spread. Visually (and as measured by the
standard deviation), it seems obvious that the scores for Section B have
more spread. Hence, the standard deviation is a better measure of spread.
- This
is a graph of the monetary amount (in dollars) of carried coin money. This
variable has a natural boundary, as someone cannot carry a negative amount
of coins. Furthermore, most people carry no coins or only a small number
of coins, while a few people carry a lot of coins. So the tail of the
distribution can extend out to the right, but it can’t extend to the left
(because of the $0 boundary).
Why are the other answers
incorrect?
·
The hours of sleep on a typical weeknight does
have natural boundaries on both sides (0 on the low end and practically
speaking, say 12, on the high end), but it’s not clear that most of the
observations would clump at one of those boundaries. It makes more sense for
the distribution of this variable to be much less skewed than the graph shown.
·
If students truly choose a random integer, then
the distribution of values should be uniform, rather than severely skewed.
Humans are incapable, though, of choosing a random number, hence it’s very
possible the distribution wouldn’t be uniform. That said, a huge majority of
students probably wouldn’t choose a low number (e.g., 0) as their “random” number. Typically, the most common
“random” integer chosen by students is 7.
·
Heights of people typically follow a normal
distribution (there are natural boundaries for height, but the typical values
of height fall far enough away from the boundaries that tails can form on both
sides of the distribution). Since female and male heights were both recorded,
it’s possible the graph of the height variable would be bi-modal. Neither of
these shapes is displayed in the graph shown.
- There
are many errors with Bubba’s stem-and-leaf plot:
·
The leaves are not ordered.
·
No leaf unit is given.
·
The “5” stem is not included.
·
The graph isn’t titled.
- Bubba
should not create a histogram of home states, as this is a categorical
variable. It only makes sense to create a histogram if the variable is
continuous (e.g., GPA and
height). The home states could be shown in a bar chart, but not a
histogram.
- Listed
below are the reasons why each statement is correct or incorrect:
·
The median
maximum speed for steel roller coasters is 50 mph. Because the distribution is
obviously skewed right, the mean will actually be larger than 50 mph (but we
can’t know the exact value from the boxplot).
·
Because the first quartile is closer to the
median than the third quartile is (and there are the same number of
observations in the range
as in the range
), we know the observations are more densely packed at the
low end of the distribution and more spread out at the high end of the
distribution. Hence, the distribution is skewed
right. (We can also see this from the long right tail of the boxplot.)
·
This statement is correct, and can be seen by
simply comparing the vertical lines in the boxes.
·
The boxplots give no
information about how many of each coaster were sampled (it only shows
percentiles). Hence we can not tell
there are more steel coasters.
·
The median maximum speed for the steel coasters
is 50 mph, so 50% of these coasters have maximum speeds above 50 mph. The first
quartile for the wooden coaster speeds is 50 mph, so 75% of these coasters have
maximum speeds above 50 mph. Hence, a lower
percentage of steel coasters have maximum speeds above 50 mph.
- The
distribution of a random sample of incomes from the United States
is most likely be skewed right. There is a natural boundary of $0 as an
income (people can’t make a negative amount of money). Also, many people
make a small or moderate amount of money, yet a very few make an obscenely
large amount of money, which creates a long right tail (because of the
boundary at $0, there can’t be an equivalent left tail). Because the
distribution is skewed right, the mean income will be greater than the median income (the mean is pulled up by the
long right tail). Furthermore, the five-number
summary is a better numerical summary for skewed distributions. (The
mean and standard deviation should be used for symmetric distributions.)