Math 217 Homework 2 Solutions

 

11.7

  1. The country variable is an indicator variable (0—U.K. and 1—U.S.). Hence, all other variables remaining the same, U.S. citizens are willing in to pay 0.2304 fewer units (not sure if it’s dollars or pounds) for non-biotech cereal than are U.K. citizens.

 

  1. Let be the slope coefficient corresponding to the country indicator variable. Then we want to test  against . In this case, the null hypothesis says that U.S. citizens and U.K. citizens are willing to pay the same amount extra for non-biotech cereal (all other variables remaining the same).

 

We aren’t sure of the degrees of freedom (because the number of predictor variables isn’t given), but it’s probably very large (since n = 1810). That is, it’s well over 1000. From Table D (using df=1000), we can say the P-value is less than 2(.0005) = 0.001 (and it’s probably much less than this). Hence, country does have a significant impact on the willingness to pay more. (But is it practically significant? We can’t be sure without knowing the units on the response variable.)

 

  1. The U.K. response rate is much lower than that for U.S. citizens. What about the 71.5% of the U.K. sample that didn’t respond? Do they share something in common that would affect the response variable? If so, the sample is biased. Hence, because of the low response rates, we should be leery of generalizing these results to the whole U.K. and U.S. populations.

 

The exclusion of people who answered “don’t know” is another place for potential bias. If a large percentage answered “don’t know,” then maybe the question was worded poorly or maybe the “don’t know” gives us important information about a certain question. I would like to know what percentage of the survey subjects answered “don’t know” (and if it’s high, then I’d wonder about potential bias in the results).

 

11.16

  1. The scatterplot is shown below. There seems to be a positive relationship between the two variables. The relationship looks slightly curved, but that’s mainly because of the two outlying companies (Charles Schwab and Fidelity).

 

 

  1. The regression output from fitting a straight line is shown below. The P-value for the test on the slope coefficient is 0.000. Hence, the number of Internet accounts has a significant impact on the total assets of a company. In particular, for each additional 1000 Internet accounts a company has, it is predicted to have an additional $83,200,000 in total assets. (And the R-squared value is very high.)

 

Regression Analysis: Total Assets (bi versus Internet Account

 

The regression equation is

Total Assets (billions of $) = - 17.1 + 0.0832 Internet Accounts (in 1000s)

 

Predictor                         Coef   SE Coef      T      P

Constant                       -17.121     8.778  -1.95  0.087

Internet Accounts (in 1000s)  0.083205  0.007592  10.96  0.000

 

S = 20.1877   R-Sq = 93.8%   R-Sq(adj) = 93.0%

  1. The residual plot is shown below. The plot does not show a random scatter, but rather a pattern of curvature. This indicates we should fit a curve rather than a straight line. (Note: The residuals also do not look normal, but perhaps fixing the mis-specification will also help with the normality.)

 

 

  1. The output and residual plots from the quadratic regression are shown below. Note the coefficient for number of Internet accounts is no longer significant, but the coefficient for the square of the Internet accounts is (and it doesn’t make sense to include the squared term without the linear term).

 

The residual plot no longer shows a pattern of curvature (although the two outliers are clearly present). And the condition of normality seems plausible, based on the histogram and normality plot. Hence, it seems like the conditions of our regression model are met. Plus the R-squared value is now at 97.9%, indicating that our model explains 97.9% of the variation in total assets—this is a very high.

 

Regression Analysis: Total Assets versus Internet Acc, Internet Acc

 

The regression equation is

Total Assets (billions of $) = 7.61 - 0.0046 Internet Accounts (in 1000s)

                               + 0.000034 Internet Accounts Squared

 

Predictor                           Coef     SE Coef      T      P

Constant                           7.608       8.503   0.89  0.401

Internet Accounts (in 1000s)    -0.00457     0.02378  -0.19  0.853

Internet Accounts Squared     0.00003361  0.00000893   3.76  0.007

 

S = 12.4117   R-Sq = 97.9%   R-Sq(adj) = 97.3%

 

 

11.18

As I mentioned in the previous problem, the two outlying companies are Charles Schwab and Fidelity. After removing these two companies, the linear regression output is shown below. The residuals indicate that the model conditions are probably not met (slight fanning out in the residual plot and the histogram and normality plot indicate deviations from normality). Hence, we should be leery of doing inference with this model. (If we feel comfortable doing inference, then we see that the Internet accounts variable is statistically significant at the 0.05 level.) The slope coefficient changed quite a bit without the outliers: Now for each additional 1000 Internet accounts a company has, it is predicted to have an additional $29,700,000 in total assets Note that the R-squared value is now only 50%, which is much lower than when the outliers were included.

 

 

Regression Analysis: Total Assets (bi versus Internet Account

 

The regression equation is

Total Assets (billions of $) = 2.13 + 0.0297 Internet Accounts (in 1000s)

 

Predictor                          Coef  SE Coef     T      P

Constant                          2.131    5.787  0.37  0.725

Internet Accounts (in 1000s)    0.02967  0.01211  2.45  0.050

 

S = 9.36903   R-Sq = 50.0%   R-Sq(adj) = 41.7%

 

 

 

 

Note that since the residual plot does not show any curvature, it doesn’t really make sense to fit a quadratic model. The problem asks for this, but it wouldn’t be a natural step in the data analysis.

 

Below is the output from the quadratic regression. (A look at the residuals indicates that normality seems plausible, as does the constant-variance, although there is a slight fanning out in the residual plot.) Notice that the F-test for the overall model is not significant (p-value = 0.093). Hence, we cannot reject the hypothesis that all slope coefficients are zero. This is then the end of our analysis. (Which isn’t surprising based on the italicized comment I made above.)

 

Regression Analysis: Total Assets versus Internet Acc, Internet Acc

The regression equation is

Total Assets (billions of $)_1 = - 6.74 + 0.0887 Internet Accounts (in 1000s)_1

                                 - 0.000063 Internet Accounts Squared_1

 

Predictor                              Coef     SE Coef      T      P

Constant                             -6.737       9.204  -0.73  0.497

Internet Accounts (in 1000s)_1      0.08874     0.05016   1.77  0.137

Internet Accounts Squared_1     -0.00006251  0.00005163  -1.21  0.280

 

S = 9.02527   R-Sq = 61.4%   R-Sq(adj) = 45.9%

 

Analysis of Variance

Source          DF       SS      MS     F      P

Regression       2   646.80  323.40  3.97  0.093

Residual Error   5   407.28   81.46

Total            7  1054.08

11.24

  1. Correlation only makes sense for quantitative variables. The correlations between the 3 quantitative variables are shown below.

 

Correlations: Grade Point Average, IQ Test Score, Self-Concept Score

 

                             Grade Point Avg     IQ Test Score

IQ Test Score                   0.634

Self-Concept Score              0.542             0.493

 

The straight-line relationship with IQ test score explains of the variation in GPA. The straight-line relationship with Self-Concept score explains of the variation in GPA.

 

  1. The model must include both IQ score and Self-Concept score: , for , where the  are independent  random variables.

 

  1. The regression output and residual plots are shown below. First notice that the regression conditions seem to be (at least slightly violated). The residual plot shows a slight fanning out (as do the plots of residuals versus each predictor variable). Furthermore, the residuals show a slight right skewness. These aren’t major violations, but they should give us some pause when doing inference.

 

The regression model explains 47.1% of the variation in GPA.

 

Regression Analysis: Grade Point  versus IQ Test Scor, Self-Concept

 

The regression equation is

Grade Point Average = - 3.88 + 0.0772 IQ Test Score + 0.0513 Self-Concept Score

 

Predictor              Coef  SE Coef      T      P

Constant             -3.882    1.472  -2.64  0.010

IQ Test Score       0.07720  0.01539   5.02  0.000

Self-Concept Score  0.05125  0.01633   3.14  0.002

 

S = 1.54715   R-Sq = 47.1%   R-Sq(adj) = 45.7%

 

 

  1. The question of interest translates to testing  against . For this test, the P-value is 0.002. That is, if (in the presence of IQ score), self-concept score has no linear impact on GPA, then there is only a 0.002 chance of getting our particular sample coefficient or a more extreme coefficient. That is, we have strong evidence that self-concept score has a statistically significant impact on GPA, even in the presence of IQ score. Is this result practically significant? Well, for each additional “point” on the self-concept score, the GPA (if IQ test is held constant) is predicted to increase by 0.05. We’d need an expert’s opinion to assess the practical significance of this. Furthermore, I give these results with the caveat that the model conditions might not be met (see my comments in part c), so our inference might not be accurate.

 

 

 

  1. I used an indicator variable for gender where (0 = male) and (1 = female). The regression results for the 3-predictor model are shown below. Again notice that the model conditions seem slightly violated (there is the same slight fanning in the residual plot, plus there is an outlying residual that skews the distribution—perhaps this point should be removed and analysis rerun in order to assess its impact). All the coefficients are statistically significant—that is, they contribute significantly to predicting GPA. The R-squared value for this model (52.1%) is slightly higher than the previous model, but still not great.

 

Interpretation of the gender coefficient: For constant IQ score and Self-Concept score, females are predicted to have a GPA 0.9685 higher than for males.

 

Regression Analysis: Grade Point  versus IQ Test Scor, Self-Concept, ...

 

The regression equation is

Grade Point Average = - 5.02 + 0.0841 IQ Test Score + 0.0513 Self-Concept Score

                      + 0.969 Gender_F

 

Predictor              Coef  SE Coef      T      P

Constant             -5.022    1.470  -3.42  0.001

IQ Test Score       0.08412  0.01495   5.62  0.000

Self-Concept Score  0.05129  0.01565   3.28  0.002

Gender_F             0.9685   0.3495   2.77  0.007

 

S = 1.48253   R-Sq = 52.1%   R-Sq(adj) = 50.1%

 

 

11.26

  1. For VO+, graphics and numerical summaries are shown below. There is a longer right tail in the distribution of VO+ values. The median is 870 (no units given). The boxplot indicates an outlying VO+ value on the high end.

 

 

Descriptive Statistics: VOPlus (Bone Formation Measure)

Variable     N  Mean  StDev  Minimum   Q1  Median    Q3  Maximum   IQR

VOPlus      31   986    580      285  513     870  1251     2545   738

 

 

 

 

 

 

For VO-, graphics and numerical summaries are shown below. The distribution is somewhat hard to characterize. Most of the women have VO- values in the 500-1000 range, and then there’s a small right tail The median is 903 (no units given). The boxplot indicates an outlying VO- value on the high end.

 

 

Descriptive Statistics: VOMinus (Bone Resorption Meas.)

Variable    N   Mean  StDev  Minimum     Q1  Median      Q3   Maximum    IQR

VOMinus    31  889.2  427.6    254.0  536.0   903.0  1028.0    2236.0  492.0

 

 

 

For Osteocalcin, graphics and numerical summaries are shown below. The distribution of values is skewed to the right, with a median of 30.20 mg/ml. The boxplot indicates no outlying points.

 

 

Descriptive Statistics: Osteocalcin (mg/ml) - Biomarker

Variable      N   Mean  StDev  Minimum     Q1  Median     Q3  Maximum    IQR

Osteocalcin  31  33.42  19.61     8.10  17.90   30.20  47.70    77.90  29.80

 

 

 

For Tartrate Resistant Acid Phosphatase (TRAP), graphics and numerical summaries are shown below. The distribution of values is skewed to the right, with a median of 10.30 units per liter. The boxplot indicates no outlying points.

   

 

Descriptive Statistics: TRAP (U/l) - Biomarker

Variabl      N   Mean  StDev  Minimum    Q1  Median     Q3  Maximum    IQR

TRAP        31  13.25   6.53     3.30  8.80   10.30  19.00    28.80  10.20

 

  1. To descriptively summarize the relationships between the variables we can look at scatterplots and correlations (shown below).  All the relationships are positive and at least somewhat linear. The strongest linear relationship is between VO+ and VO-. TRAP also has a strong linear relationship with both VO+ and Osteocalcin.

 

 

            Correlations: VOPlus, VOMinus, Osteocalcin , and TRAP

                                   VOPlus  VOMinus  Osteocalcin

VOMinus            0.898

Osteocalcin        0.647    0.455

TRAP               0.754    0.678        0.730

 

           

11.27

  1. The regression output (from the regression of VO+ on OC) is shown below. The residuals appear quite non-normal, indicating a violation in one of our model conditions (the constant-variance condition seems plausible). Hence, even though OC is shown to have a significant (P-value = 0.000) linear impact on VO+, this inference might be inaccurate. Descriptively, we see that 41.9% of the variation in VO+ is explained by the regression model. Also, for each additional mg/ml of OC, the predicted VO+ increases by 19.1 units.

 

Regression Analysis: VOPlus (Bone For versus Osteocalcin (mg/

 

The regression equation is

VOPlus (Bone Formation Measure) = 346 + 19.1 Osteocalcin (mg/ml) - Biomarker

 

Predictor                          Coef  SE Coef     T      P

Constant                          346.2    161.5  2.14  0.041

Osteocalcin (mg/ml) - Biomarker  19.142    4.185  4.57  0.000

 

S = 449.527   R-Sq = 41.9%   R-Sq(adj) = 39.9%

 

 

 

  1. The regression output (from the regression of VO+ on both OC and TRAP) is shown below. The residual versus fits graph shows a slight fanning out (as do both graphs of residuals versus the predictor variables—not shown here). This isn’t a horrible violation of the constant-variance condition. But it does appear (based on the histogram and normal plot of the residuals) that the normality condition is not met. Hence, the inference I discuss next might not be accurate.

 

The overall F test (P-value = 0.000) indicates that at least one of the population regression coefficients is not zero. Looking at the individual coefficients, the one associated with OC does not have a significant (P-value = 0.25) impact on VO+, in the presence of TRAP; the one associated with TRAP does have a significant (P-value = 0.002) impact on VO+, even in the presence of OC. This is not surprising, given that OC and TRAP are correlated, yet TRAP is more strongly correlated to VO+ than OC is. [Again, it’s important to realize that this inference might not be accurate, since the normality condition of our model is not met.]

 

Descriptively, we see that 58.8% of the variation in VO+ is explained by the regression model. Also, for each additional mg/ml of OC, the predicted VO+ increases by 6.157 units (assuming TRAP is held constant); for each additional unit/liter of TRAP, the predicted VO+ increases by 53.44 units (assuming OC is held constant.

 

Regression Analysis: VOPlus versus Osteocalcin , TRAP  

 

The regression equation is

VOPlus (Bone Formation Measure) = 72 + 6.16 Osteocalcin (mg/ml) - Biomarker

                                  + 53.4 TRAP (U/l) - Biomarker

 

Predictor                         Coef  SE Coef     T      P

Constant                          72.1    160.2  0.45  0.656

Osteocalcin (mg/ml) - Biomarker  6.157    5.246  1.17  0.250

TRAP (U/l) - Biomarker           53.44    15.76  3.39  0.002

 

S = 385.162   R-Sq = 58.8%   R-Sq(adj) = 55.9%

 

Analysis of Variance

Source          DF        SS       MS      F      P

Regression       2   5933269  2966635  20.00  0.000

Residual Error  28   4153791   148350

Total           30  10087061

 

 

 

 

11.28

  1. The statistical model is  for , where the are independent  random variables.

 

 

 

 

 

  1. The output from the regression of VO+ on OC, TRAP, and VO- is shown below. Based on the residuals, the constant-variance condition seems to be met, and the normality condition seems plausible. Hence, we trust the results from our inference. (Also, although not presented with the output, the separate plots of residuals versus each predictor variable show no apparent patterns.)

 

The overall F test (P-value = 0.000) indicates that at least one of the population regression coefficients is not zero. Looking at the individual coefficients, the one associated with OC does have a significant (P-value = 0.010) impact on VO+, even in the presence of TRAP and VO-; the one associated with TRAP does not have a significant (P-value = 0.637) impact on VO+, in the presence of OC and VO-; the one associated with VO- does have a significant impact on VO+, even in the presence of OC and TRAP.

 

Descriptively, we see that 87.9% (which is quite high) of the variation in VO+ is explained by the regression model. Also, for each additional mg/ml of OC, the predicted VO+ increases by 8.021 units (assuming TRAP and VO- are held constant); for each additional unit/liter of TRAP, the predicted VO+ increases by 5.04 units (assuming OC and VO- are held constant); for each additional unit of VO-, the predicted VO+ increases by 0.9979 units (assuming OC and TRAP are held constant).

 

Regression Analysis: VOPlus (Bone versus Osteocalcin , TRAP (U/l) -, ...

 

The regression equation is

VOPlus (Bone Formation Measure) = - 236 + 8.02 Osteocalcin (mg/ml)

                                        + 5.0 TRAP (U/l) + 0.998 VOMinus (Bone Resorption Meas.)

 

Predictor                           Coef  SE Coef      T      P

Constant                         -236.36    96.37  -2.45  0.021

Osteocalcin (mg/ml) - Biomarker    8.021    2.904   2.76  0.010

TRAP (U/l) - Biomarker              5.04    10.57   0.48  0.637

VOMinus (Bone Resorption Meas.)   0.9979   0.1239   8.06  0.000

 

S = 212.580   R-Sq = 87.9%   R-Sq(adj) = 86.6%

 

Analysis of Variance

Source          DF        SS       MS      F      P

Regression       3   8866925  2955642  65.40  0.000

Residual Error  27   1220136    45190

Total           30  10087061

 

 

  1.  Model 1 (only OC as a predictor):

 

Predictor                          Coef  SE Coef     T      P

Constant                          346.2    161.5  2.14  0.041

Osteocalcin (mg/ml) - Biomarker  19.142    4.185  4.57  0.000

 

Model 2 (both OC and TRAP as predictors):

Predictor                          Coef  SE Coef     T      P

Constant                           72.1    160.2  0.45  0.656

Osteocalcin (mg/ml) - Biomarker   6.157    5.246  1.17  0.250

TRAP (U/l) - Biomarker            53.44    15.76  3.39  0.002

 

      Model 3 (OC, TRAP, and VO- as predictors):

Predictor                           Coef  SE Coef      T      P

Constant                         -236.36    96.37  -2.45  0.021

Osteocalcin (mg/ml) - Biomarker    8.021    2.904   2.76  0.010

TRAP (U/l) - Biomarker              5.04    10.57   0.48  0.637

VOMinus (Bone Resorption Meas.)   0.9979   0.1239   8.06  0.000

The estimated coefficient on OC changes greatly from the first model (when it’s the only predictor) compared to the second and third models. Furthermore, the significance of the OC coefficient changes between models (significant in the first, non-significant in the second, and significant again in the third). The estimated coefficient on TRAP stays fairly consistent between models 2 and 3 (and it’s non-significant in both).

 

  1. Model 1 (only OC as a predictor): S = 449.527   R-Sq = 41.9%

Model 2 (both OC and TRAP as predictors):  S = 385.162   R-Sq = 58.8%

Model 3 (OC, TRAP, and VO- as predictors):  S = 212.580   R-Sq = 87.9%

 

As the number of predictor variables increases, the percentage of variation explained (R-squared) increases, and the standard deviation of the residuals, s, decreases. (These are both good things from a model standpoint. Although R-squared always increases with an additional predictor variable, even if the variable doesn’t have a significant impact on the response.)

 

  1. The model run in part b, showed that OC and VO- were significant, but TRAP wasn’t. Hence, we can run a model with only OC and VO- as predictor variables. The output from this regression is shown below. Based on the residuals, the constant-variance condition seems to be met, and the normality condition seems plausible. Hence, we trust the results from our inference. (Also, although not presented with the output, the separate plots of residuals versus each predictor variable show no apparent patterns.)

 

The overall F test (P-value = 0.000) indicates that at least one of the population regression coefficients is not zero. Looking at the individual coefficients, the one associated with OC does have a significant (P-value = 0.000) impact on VO+, even in the presence of VO-; the one associated with VO- does have a significant (P-value = 0.000) impact on VO+, even in the presence of OC.

 

Descriptively, we see that 87.8% (which is quite high) of the variation in VO+ is explained by the regression model. Also, for each additional mg/ml of OC, the predicted VO+ increases by 8.921 units (assuming VO- stays constant); for each additional unit of VO-, the predicted VO+ increases by 1.0315 units (assuming OC stays constant).

 

This is clearly the best model of the four we ran. It has the highest adjusted R-squared value (which adjusts for the number of predictor variables) and all the coefficients are statistically significant (which we can trust, because the model conditions seem to be met).

 

Regression Analysis: VOPlus versus Osteocalcin , VOMinus

 

The regression equation is

VOPlus (Bone Formation Measure) = - 229 + 8.91 Osteocalcin (mg/ml) + 1.03 VOMinus

 

Predictor                           Coef  SE Coef      T      P

Constant                         -229.23    93.88  -2.44  0.021

Osteocalcin (mg/ml) - Biomarker    8.912    2.191   4.07  0.000

VOMinus (Bone Resorption Meas.)   1.0315   0.1005  10.26  0.000

 

S = 209.626   R-Sq = 87.8%   R-Sq(adj) = 86.9%

 

Analysis of Variance

Source          DF        SS       MS       F      P

Regression       2   8856651  4428325  100.77  0.000

Residual Error  28   1230410    43943

Total           30  10087061