Math 217 Homework 1 Solutions

 

1.129

Included below is a histogram of the biological-clock cycle lengths. Also included are the numerical summaries for the same variable.

 

 

Descriptive Statistics: Bio. Clock Length (in hours)

Variable                      N    Mean  StDev  Minimum      Q1  Median      Q3  Maximum    IQR

Bio. Clock Length (in hrs)  149  24.339  0.924   22.000  23.735  24.310  24.845   28.550  1.110

 

The distribution of biological-clock cycle lengths for the Arabidopsis plant is approximately symmetric around the value 24.3 hours (both the mean and median value are 24.3 hours). The standard deviation is 0.924 hours. This is a fairly small spread (less than 1 hour). There are, though, outliers in the data set (these are clearly seen in the boxplot below). Based on the description in the textbook, I would guess the high outliers all come from the same location (either north or south).

 

 

 

3.59

Children from larger families are overrepresented in such a sample. For example, suppose there are 100 families with children—60 families have one child and 40 have three children. Then there are a total of 180 children (an average of 1.8 per family), and two-thirds (120/180) of these children come from families with three children. Hence, in a sample (a class) of these children, about one-third would answer “one” to the teacher’s question and about two-thirds would answer “three” to the teacher’s question. This would give a sample average of 2.33 children per family (much higher than the true average of 1.8). Instead of sampling children (who over-represent larger families), families should be sampled.

 

 

 

 

 

 

3.60

Clearly, there are many correct answers to this problem. Here are some example answers:

 

  1. Do you think the government should cut education funding, leaving students to buy their own textbooks, learn less material, and be exposed to less innovative teaching techniques? (This is geared toward a “no” answer.)

 

  1. In the future, say in 2 years, do you think you will not be involved in activities, such as volunteer groups, church groups, social groups, or do you think that you will become more involved, since you’ll want more stimulus, or will you focus more on your career? (This is a wordy, confusing, unclear question.)

 

 

7.37

The stem-and-leaf plot of the recorded values for the 12 radon detectors are shown below.

 

  1. The stem plot shows values piling up in the center, yet a slightly larger right tail. Since the sample size is so small, it’s very important that the normality condition of the t-test is met (we check the normality of the population by looking at a graph of the sample data distribution). Since the skewness isn’t really strong, we can consider these data roughly mound-shaped (i.e., the normality condition seems to be met).

 

Stem-and-Leaf Display: Radon Detection (pCi/l)

Stem-and-leaf of Radon Detection (pCi/l)  [N  = 12]

Leaf Unit = 1.0

 

 9| 1

 9| 5 6 7 9

10| 1 3 4

10| 5

11| 1

11| 9

12| 2

 

  1. Let  be the mean detection level for the population of all radon detectors of this type. Then we want to test the hypothesis: . Since the population standard deviation is unknown, we must use a t-test. In part a we showed that the normality condition of the t-test seems to be met.

 

Now we calculate the test statistic (you can use Minitab to find the sample mean and standard deviation for these data): . [Note: This tells us that our particular sample mean is 0.321 standard errors below the null-hypothesized population mean.]

 

Since this is a two-sided test, our P-value is doubled: , where T has a t-distribution with 11 degrees of freedom. Based on Table D, all we can say is that our P-value is greater than 0.5. [We can get the exact P-value from Minitab: 0.754.]

 

Hence, assuming the mean detection level for the population of all radon detectors of this type is 105 pCi/l, there is more than a 50% chance of getting our particular sample average reading or a more extreme reading. That is, our data are not at all unlikely, and these results are not statistically significant at any reasonable significance level. We have no evidence that the mean reading of all detectors differs from 105 pCi/l.

 

 

10.4

  1. Histograms and numerical summaries are shown below for both the index of biotic integrity (IBI) and area of watershed for the 49 streams in the Ozark Highland Ecoregion.

 

 

Descriptive Statistics: Index of Biotic Integrity, Area of Watershed (km-sq)

Variable     N    Mean   StDev   Minimum      Q1   Median      Q3  Maximum    IQR

IBI         49   65.94   18.28     29.00   54.50    71.00   82.00    91.00  27.50

Area        49   28.29   17.71      2.00   15.00    26.00   36.50    70.00  21.50

 

The IBI distribution of these streams is skewed to the left, whereas the area of watershed is skewed to the right (so for each distribution, the 5-number summary is the best numerical summary). In each case, there seems to be a second (smaller) “mound” in the long tail. Perhaps there are two different “types” of streams (maybe based on some other characteristic)?

 

  1. The question does not make the response variable clear, but I assume we’re interested in predicting water quality (IBI) based on area of watershed. Included below is a scatterplot of these two variables for the 49 streams in the study. There appears to be a fairly weak, positive, linear pattern between these two variables (the correlation is 0.446). Also, the variability in IBI increases as the area of watershed decreases.

 

 

  1. The statistical (population) model for this simple linear regression is , where  and the are independent  random variables.
  2. We want to test the following hypotheses about the population slope:

 

  1. Included below are the regression results from Minitab. The slope coefficient tells us that for each increase of 10 km-sq of watershed area, we predict an increase of 4.6 in the index of biotic integrity. The P-value associated with the significance test in part d is 0.001. Hence, we have very strong evidence that watershed area has a statistically significant linear impact on water quality (IBI). But these results are only valid if the conditions of our model (normality and constant-variance) have been met. Additionally, the low R-squared value (19.9%) tells us that our model only explains about 20% of the variability in IBI (this is quite low).

 

Regression Analysis: Index of Biotic Integrity versus Area of Watershed

The regression equation is

Index of Biotic Integrity (IBI) = 52.9 + 0.460 Area of Watershed (km-sq)

 

Predictor                    Coef  SE Coef      T      P

Constant                   52.923    4.484  11.80  0.000

Area of Watershed (km-sq)  0.4602   0.1347   3.42  0.001

 

S = 16.5346   R-Sq = 19.9%   R-Sq(adj) = 18.2%

 

  1. The residual plot is shown below. There is a slight megaphone shape to the residual plot (more variability in the residuals for smaller values of area). Hence, it seems the constant-variance condition of our regression model might not be met.

 

 

  1. Both a histogram and normality plot of the residuals are shown below. There is a right-skew evident in the residuals. Hence, it appears the normality condition is not met, either.

 

 

  1. Based on my explanations in parts f and g, it seems that neither the normality nor the constant-variance conditions on the errors have been met (although we could possibly argue that constant-variance is mostly met). Hence, we cannot trust the inference done with this model. That is, our test results in part e are not valid.

 

 

10.6

I must begin by reiterating that the model conditions (particularly the normality of the errors) do not seem to be met, so our confidence and prediction intervals might not be accurate. This all said, the Minitab output is shown below

 

Area of Watershed      Fit        SE Fit           95% CI                95% PI

30.0 km-sq           66.73         2.37   (61.95, 71.50)       (33.12, 100.33)

 

  1. A 95% confidence interval for the mean IBI for a river with 30 km-sq of watershed is (61.95, 71.50).

 

b.       A 95% prediction interval for the IBI of a new river with 30 km-sq of watershed is (33.12, 100.33). Note this prediction interval is so wide that it is of no use (we could have made this interval of guesses based on the range of our original data).

 

  1. Part a finds a confidence interval for a mean IBI of many rivers with 30 km-sq of watershed. That is, if river sampling (for rivers with 30 km-sq) were done repeatedly, then 95% of the intervals created using our method would contain the true mean IBI value. Part b finds a prediction interval for a single IBI value for a new river with 30 km-sq of watershed.

 

  1. It seems like location probably plays a role in the relationship between water quality and watershed. Hence, it’s best not to generalize these results beyond streams in the Ozark Highlands. Instead, we should collect data in other areas and see what the relationship is there.

 

 

10.15

  1. Histograms and numerical summaries for both variables are shown below. Both distributions of neuron responses are skewed right with high outliers. The median neuron response to monkey calls (141 electrical spikes per second) is much higher than the median response to pure tones (72 electrical spikes per second). This doesn’t seem surprising.

 

 

Descriptive Statistics: Response to Pure Tone, Response to Monkey Call

Variable                  N   Mean  StDev  Minimum    Q1  Median     Q3  Maximum    IQR

Response to Pure Tone    37  106.2   91.8     19.0  38.0    72.0  155.5    474.0  117.5

Response to Monkey Call  37  176.6  111.8     42.0  91.0   141.0  205.5    500.0  114.5

 

  1. The scatterplot (with regression line) is shown below. The relationship is positive and somewhat linear (correlation of 0.64). The diamond represents the monkey with the largest residual and the square represents the monkey with the most extreme tone response.

 

 

  1. For all 37 observations, the regression output and residual plots are shown below. The residuals look roughly normal and indicate moderately constant-variance (the residual plot clearly shows the x-outlier). The slope coefficient tells us that for each increase of 1 electrical spike per minute in pure tone reaction, we predict an increase of 0.778 electrical spikes per minute in reaction to monkey calls. The P-value associated with the slope is 0.000. Hence, we have very strong evidence that response to pure tones has a statistically significant linear impact on the response to monkey tones. Additionally, the fairly low R-squared value (40.8%) tells us that our model only explains about 40.8% of the variability in responses to monkey calls.

 

Regression Analysis: Response to Monkey Call versus Response to Pure Tone

               

The regression equation is

Response to Monkey Call = 93.9 + 0.778 Response to Pure Tone

 

Predictor                Coef  SE Coef     T      P

Constant                93.92    22.12  4.25  0.000

Response to Pure Tone  0.7783   0.1586  4.91  0.000

 

S = 87.2968   R-Sq = 40.8%   R-Sq(adj) = 39.1%

 

 

  1. We can analyze the affect of the outliers in three ways: 1) remove the large-residual observation only, 2) remove the extreme-value observation only, and 3) remove both the observations. Then we can compare each of these new models to the full-data model described in part c.

 

Without the Large-Residual Observation

Regression Analysis: Response to Monkey Call versus Response to Pure Tone

 

The regression equation is

Response to Monkey Call_1 = 98.4 + 0.679 Response to Pure Tone_1

 

Predictor                  Coef  SE Coef     T      P

Constant                  98.42    20.52  4.80  0.000

Response to Pure Tone_1  0.6792   0.1513  4.49  0.000

 

S = 80.6894   R-Sq = 37.2%   R-Sq(adj) = 35.4%

 

 

When removing the large-residual observation, the regression output doesn’t change much. The residual plots look roughly the same. The slope is still significant and it didn’t change much (from 0.78 to 0.68). The R-squared valued goes down (from 40.8% to 37.2%), but not by much.

 

 

Without the Extreme-Tone-Value

Regression Analysis: Response to Monkey Call versus Response to Pure Tone

 

The regression equation is

Response to Monkey Call_2 = 101 + 0.693 Response to Pure Tone_2

 

Predictor                  Coef  SE Coef     T      P

Constant                 101.10    25.53  3.96  0.000

Response to Pure Tone_2  0.6927   0.2176  3.18  0.003

 

S = 88.1351   R-Sq = 23.0%   R-Sq(adj) = 20.7%

 

     

 

When removing the extreme-tone-value observation, again the regression output doesn’t change much. The shape (approximately normal) of the residual distribution is the same. The residual plot, does look different (since the extreme value was such a prominent feature in the previous residual plots). The slope is still significant and it didn’t change much (from 0.78 to 0.69). The only big difference is the drop in R-squared (from 40.8% to 23%), which is quite substantial. In this case, the extreme value made the correlation stronger.

 

 

Without Both the Large-Residual Observation and the Extreme-Tone-Value

Regression Analysis: Response to Monkey Call  versus Response to Pure Tone

 

The regression equation is

Response to Monkey Call_3 = 116 + 0.466 Response to Pure Tone_3

 

Predictor                  Coef  SE Coef     T      P

Constant                 115.76    23.54  4.92  0.000

Response to Pure Tone_3  0.4656   0.2105  2.21  0.034

 

S = 79.4568   R-Sq = 12.9%   R-Sq(adj) = 10.3%

 

 

When removing both the large-residual and extreme-tone-value observations, the regression output changes quite a lot. The distribution of residuals looks less normal. The slope value (while still significant) changes quite a bit (from 0.78 to 0.47). Furthermore, the R-square valued plummets (from40.8% to 12.9%). These two observations, in combination, have quite an affect on the regression analysis.

 

 

10.21

  1. The scatterplot of wages and length of service is shown below. The relationship is positive and moderately linear. The circled outlier will be removed for the rest of the analysis.

 

 

  1. The regression output is shown below. It would be inappropriate to do inference if our model conditions are violated, but a look at the residuals (not shown here) indicates that both the normality and constant-variance conditions of the errors are roughly met. The P-value for the significance test of the slope is 0.006. This means if the population slope is actually 0, there is only a 0.006 chance of getting our particular sample slope value or a more extreme slope value. Because our data are so unlikely, we have strong evidence against the population slope equaling 0. That is, we have very strong evidence that there is a linear relationship between wages and length of service. (The R-Squared for our model, though, is very low, so it’s not a good predictive model.)

 

Regression Analysis: Wages (income/days worked)  versus Length of Service (in months)

 

The regression equation is

Wages (income/days worked) = 43.4 + 0.0733 Length of Service (in months)

 

Predictor                           Coef  SE Coef      T      P

Constant                          43.383    2.248  19.30  0.000

Length of Service (in months)    0.07325  0.02571   2.85  0.006

 

S = 10.2131   R-Sq = 12.5%   R-Sq(adj) = 10.9%

 

  1. For each additional month a female bank employee works, her predicted wages increase by 0.07 units (income/days worked).

 

  1. The sample size is 60, so the degrees of freedom for the t-distribution is 59 (this isn’t listed in Table D, so we can simply use the 60 row). The multiplier is 2.000. Hence, a 95% confidence interval for the slope is (0.022, 0.125).

 

 

10.34

The residual plot from the regression in Problem 10.21 is shown below. The size of the bank (L=large, S=small) at which each employee works is marked on the residual plot. Notice, particularly in the left-part of the graph, there is a clumping of large-bank residuals above the 0-line and a clumping of small-bank residuals below the 0-line. This indicates that our regression line tends to underestimate wages for employees at large banks and overestimate wages for employees at small banks.

 

 

 

10.44

  1. The scatterplot of metabolic rate and lean body mass (separately for males and females) is shown below. Overall (with all 19 subjects), the relationship appears positive and strongly linear. Looking separately at males and females, it appears the relationship between these variables is more strongly linear for females.

 

 

 

  1. The regression output and residual plots for females only are shown below. The residual plots show a clear violation of the normality condition. Hence, any inference we do is questionable. The R-squared value is fairly high (76.8%), so this model explains a large amount of the variation in metabolic rates. The slope indicates that for each additional kilogram of lean body mass, a woman is predicted to gain 24 calories in metabolic rate.

 

Females Only, Regression Analysis: Metabolic Rate (in calories) versus Lean Body Mass (in kilograms)

 

The regression equation is

Metabolic Rate (in calories) = 201 + 24.0 Lean Body Mass (in kilograms)

 

Predictor                        Coef  SE Coef     T      P

Constant                        201.2    181.7  1.11  0.294

Lean Body Mass (in kilograms)  24.026    4.174  5.76  0.000

 

S = 95.0808   R-Sq = 76.8%   R-Sq(adj) = 74.5%

 

 

           

The regression output and residual plots for males only are shown below. Again, the normality condition seems to be violated, which makes inference on the slope questionable. It is noticeable that the slope is no longer significant for the males (even though it was for females), but we can’t count on the accuracy of this inference. But violations of our model conditions don’t impact our interpretation of the slope coefficients and R-squared values. The slope is quite different for males (for each additional kilogram of lean body mass, a man is predicted to gain 16.8 calories in metabolic rate). Furthermore, the R-squared value is much lower for males (35.1%, as compared to 76.8% for females).

 

            Males Only, Regression Analysis: Metabolic Rate (in calories) versus Lean Body Mass (in kilograms)

 

The regression equation is

Metabolic Rate (in calories) = 711 + 16.8 Lean Body Mass (in kilograms)

 

Predictor                       Coef  SE Coef     T      P

Constant                       710.5    545.1  1.30  0.249

Lean Body Mass (in kilograms)  16.75    10.20  1.64  0.161

 

S = 167.062   R-Sq = 35.1%   R-Sq(adj) = 22.1%

 

 

 

10.45

  1. This problem asks us to determine confidence intervals, even though it appears the normality condition of our regression model is violated. Hence,  the confidence level (95%) might be inaccurate.

 

There are 12 females, so there are 11 degrees of freedom for the t-distribution, and for 95% confidence, . So for the females, a 95% confidence interval for the slope is (14.839, 33.213).

 

There are 7 females, so there are 6 degrees of freedom for the t-distribution, and for 95% confidence, . So for the males, a 95% confidence interval for the slope is (-8.21, 31.71).

 

Notice that the two confidence intervals overlap (i.e., share common, likely values). Hence, we do not have evidence that the population slopes are actually different.

 

  1. For females, .

 

For males,

 

This quantity is in the denominator of the standard error for the estimated slope. Hence, if the quantity is made larger, the standard error decreases (which makes it easier to detect a significant slope).

 

  1. From part b it’s clear that we not only want as many observations as possible, but we also want as large a spread in the x-variable as possible. Hence, we should collect data on males with a larger range of lean body mass (not just at the high end).