Math 217 Homework 1 Solutions
1.129
Included below is a histogram of the biological-clock cycle lengths. Also included are the numerical summaries for the same variable.

Descriptive Statistics: Bio. Clock
Length (in hours)
Variable
N Mean StDev Minimum
Q1 Median Q3
Maximum IQR
Bio. Clock Length (in hrs) 149 24.339
0.924 22.000 23.735
24.310 24.845 28.550
1.110
The distribution of
biological-clock cycle lengths for the Arabidopsis plant is approximately
symmetric around the value 24.3 hours (both the mean and median value are 24.3
hours). The standard deviation is 0.924 hours. This is a fairly small spread
(less than 1 hour). There are, though, outliers in the data set (these are
clearly seen in the boxplot below). Based on the
description in the textbook, I would guess the high outliers all come from the
same location (either north or south).

3.59
Children from larger
families are overrepresented in such a sample. For example, suppose there are
100 families with children—60 families have one child and 40 have three
children. Then there are a total of 180 children (an average of 1.8 per
family), and two-thirds (120/180) of these children come from families with
three children. Hence, in a sample (a class) of these children, about one-third
would answer “one” to the teacher’s question and about two-thirds would answer
“three” to the teacher’s question. This would give a sample average of 2.33
children per family (much higher than the true average of 1.8). Instead of
sampling children (who over-represent larger families), families should be sampled.
3.60
Clearly, there are many
correct answers to this problem. Here are some example answers:
7.37
The stem-and-leaf plot of
the recorded values for the 12 radon detectors are shown below.
Stem-and-Leaf
Display: Radon Detection (pCi/l)
Stem-and-leaf
of Radon Detection (pCi/l) [N
= 12]
Leaf
Unit = 1.0
9| 1
9| 5 6 7 9
10|
1 3 4
10|
5
11|
1
11|
9
12|
2
Now
we calculate the test statistic (you
can use Minitab to find the sample mean and standard deviation for these data):
. [Note: This tells us that our particular sample mean is
0.321 standard errors below the null-hypothesized population mean.]
Since
this is a two-sided test, our P-value
is doubled:
, where T has a t-distribution with 11 degrees of freedom.
Based on Table D, all we can say is that our P-value is greater than 0.5. [We can get the exact P-value from Minitab: 0.754.]
Hence,
assuming the mean detection level for the population of all radon detectors of
this type is 105 pCi/l,
there is more than a 50% chance of getting our particular sample average
reading or a more extreme reading. That is, our data are not at all unlikely,
and these results are not statistically significant at any reasonable
significance level. We have no evidence that the mean reading of all detectors
differs from 105 pCi/l.
10.4


Descriptive
Statistics: Index of Biotic Integrity, Area of Watershed (km-sq)
Variable N Mean StDev Minimum Q1 Median
Q3 Maximum IQR
IBI 49
65.94 18.28 29.00 54.50 71.00 82.00
91.00 27.50
Area 49 28.29 17.71 2.00 15.00 26.00 36.50
70.00 21.50
The
IBI distribution of these streams is skewed to the left, whereas the area of
watershed is skewed to the right (so for each distribution, the 5-number
summary is the best numerical summary). In each case, there seems to be a
second (smaller) “mound” in the long tail. Perhaps there are two different
“types” of streams (maybe based on some other characteristic)?

Regression
Analysis: Index of Biotic Integrity versus Area of Watershed
The regression equation is
Index
of Biotic Integrity (IBI) = 52.9 + 0.460 Area of Watershed (km-sq)
Predictor Coef SE Coef T
P
Constant 52.923 4.484 11.80
0.000
Area
of Watershed (km-sq)
0.4602 0.1347 3.42
0.001
S = 16.5346 R-Sq = 19.9% R-Sq(adj) = 18.2%



10.6
I must begin by
reiterating that the model conditions (particularly the normality of the
errors) do not seem to be met, so our confidence and prediction intervals might
not be accurate. This all said, the Minitab output is shown below
Area of Watershed
Fit SE Fit 95% CI
95% PI
30.0 km-sq 66.73 2.37
(61.95, 71.50) (33.12, 100.33)
b.
A 95%
prediction interval for the IBI of a new river with 30 km-sq of watershed is
(33.12, 100.33). Note this prediction
interval is so wide that it is of no use (we could have made this interval of
guesses based on the range of our original data).
10.15


Descriptive
Statistics: Response to Pure Tone, Response to Monkey Call
Variable N Mean StDev Minimum
Q1 Median Q3
Maximum IQR
Response
to Pure Tone 37 106.2
91.8 19.0 38.0
72.0 155.5 474.0
117.5
Response
to Monkey Call 37 176.6
111.8 42.0 91.0
141.0 205.5 500.0
114.5

Regression
Analysis: Response to Monkey Call versus Response to Pure Tone
The regression equation is
Response
to Monkey Call = 93.9 + 0.778 Response to Pure Tone
Predictor Coef SE Coef T
P
Constant 93.92 22.12 4.25
0.000
Response
to Pure Tone 0.7783 0.1586
4.91 0.000
S = 87.2968 R-Sq
= 40.8% R-Sq(adj) = 39.1%



Without
the Large-Residual Observation
Regression
Analysis: Response to Monkey Call versus Response to Pure Tone
The regression equation is
Response
to Monkey Call_1 = 98.4 + 0.679 Response to Pure Tone_1
Predictor Coef SE Coef T
P
Constant 98.42 20.52 4.80
0.000
Response
to Pure Tone_1 0.6792 0.1513
4.49 0.000
S = 80.6894 R-Sq = 37.2% R-Sq(adj) = 35.4%



When
removing the large-residual observation, the regression output doesn’t change
much. The residual plots look roughly the same. The slope is still significant
and it didn’t change much (from 0.78 to 0.68). The R-squared valued goes down
(from 40.8% to 37.2%), but not by much.
Without
the Extreme-Tone-Value
Regression
Analysis: Response to Monkey Call versus Response to Pure Tone
The regression equation is
Response
to Monkey Call_2 = 101 + 0.693 Response to Pure Tone_2
Predictor Coef SE Coef T
P
Constant 101.10 25.53 3.96
0.000
Response
to Pure Tone_2 0.6927 0.2176
3.18 0.003
S = 88.1351 R-Sq
= 23.0% R-Sq(adj) = 20.7%



When
removing the extreme-tone-value observation, again the regression output
doesn’t change much. The shape (approximately normal) of the residual
distribution is the same. The residual plot, does look
different (since the extreme value was such a prominent feature in the previous
residual plots). The slope is still significant and it didn’t change much (from
0.78 to 0.69). The only big difference is the drop in R-squared (from 40.8% to
23%), which is quite substantial. In this case, the extreme value made the
correlation stronger.
Without
Both the Large-Residual Observation and the Extreme-Tone-Value
Regression
Analysis: Response to Monkey Call versus Response to Pure Tone
The regression equation is
Response
to Monkey Call_3 = 116 + 0.466 Response to Pure Tone_3
Predictor Coef SE Coef T
P
Constant 115.76 23.54 4.92
0.000
Response
to Pure Tone_3 0.4656 0.2105
2.21 0.034
S = 79.4568 R-Sq = 12.9% R-Sq(adj) = 10.3%



When
removing both the large-residual and extreme-tone-value observations, the
regression output changes quite a lot. The distribution of residuals looks less
normal. The slope value (while still significant) changes quite a bit (from
0.78 to 0.47). Furthermore, the R-square valued plummets (from40.8% to 12.9%). These
two observations, in combination, have quite an affect on the regression
analysis.
10.21

Regression
Analysis: Wages (income/days worked) versus Length of Service (in months)
The regression equation is
Wages
(income/days worked) = 43.4 + 0.0733 Length of Service (in months)
Predictor Coef SE Coef T
P
Constant 43.383 2.248 19.30
0.000
Length
of Service (in months) 0.07325 0.02571 2.85
0.006
S = 10.2131 R-Sq
= 12.5% R-Sq(adj) = 10.9%
10.34
The residual plot from
the regression in Problem 10.21 is shown below. The size of the bank (L=large,
S=small) at which each employee works is marked on the residual plot. Notice,
particularly in the left-part of the graph, there is a clumping of large-bank
residuals above the 0-line and a clumping of small-bank residuals below the
0-line. This indicates that our regression line tends to underestimate wages
for employees at large banks and overestimate wages
for employees at small banks.

10.44

Females
Only, Regression Analysis: Metabolic Rate (in calories) versus Lean Body Mass (in
kilograms)
The regression equation is
Metabolic
Rate (in calories) = 201 + 24.0 Lean Body Mass (in kilograms)
Predictor Coef SE Coef T
P
Constant 201.2 181.7 1.11
0.294
Lean
Body Mass (in kilograms)
24.026 4.174 5.76
0.000
S = 95.0808 R-Sq = 76.8% R-Sq(adj) = 74.5%



The regression output and residual plots for males only are shown below. Again, the normality condition seems to be violated, which makes inference on the slope questionable. It is noticeable that the slope is no longer significant for the males (even though it was for females), but we can’t count on the accuracy of this inference. But violations of our model conditions don’t impact our interpretation of the slope coefficients and R-squared values. The slope is quite different for males (for each additional kilogram of lean body mass, a man is predicted to gain 16.8 calories in metabolic rate). Furthermore, the R-squared value is much lower for males (35.1%, as compared to 76.8% for females).
Males Only, Regression Analysis: Metabolic Rate (in
calories) versus Lean Body Mass (in kilograms)
The regression equation is
Metabolic
Rate (in calories) = 711 + 16.8 Lean Body Mass (in kilograms)
Predictor Coef SE Coef T
P
Constant 710.5 545.1 1.30
0.249
Lean
Body Mass (in kilograms)
16.75 10.20
1.64 0.161
S = 167.062 R-Sq
= 35.1% R-Sq(adj) = 22.1%



10.45
There
are 12 females, so there are 11 degrees of freedom for the t-distribution, and
for 95% confidence,
. So for the females, a 95% confidence interval for the slope
is
(14.839, 33.213).
There
are 7 females, so there are 6 degrees of freedom for the t-distribution, and
for 95% confidence,
. So for the males, a 95% confidence interval for the slope
is
(-8.21, 31.71).
Notice
that the two confidence intervals overlap (i.e., share common, likely values).
Hence, we do not have evidence that the population slopes are actually
different.
For
males, ![]()
This
quantity is in the denominator of the standard error for the estimated slope.
Hence, if the quantity is made larger, the standard error decreases (which
makes it easier to detect a significant slope).