You just calculated a regression model and have just started assessing the residuals in order to determine whether or not they meet two of the assumptions of linear regression; that is, if they’re normally distributed and have a constant error variance. Much to your dismay, you find that your residuals have a NON-constant error variance. In other words, your residual plot looks similar to one of the ones shown below. Oh no!
Why is this a problem? First, the standard errors for your predictor coefficients may now be incorrect. Second, your t and F statistics may no longer follow their proper distributions. In other words, you won’t be able to properly assess statistical significance (Zuur et al., 2009). Finally, any data you simulate from your model will be incorrect.
Where Does this Assumption Come From?
Below is the typical regression model:
Yi = β0 + β1X1i + β2X2i + … βpXpi + εi
εi ~ N(0, σ2)
The homogeneity of variance assumption is represented by εi ~ N(0, σ2). Basically, this is saying that each residual should come from a distribution with a mean of 0 and a variance of σ2. If this is true, then we should see a constant residual spread across the fitted values. If this is not true, then we would see differences in the residual spread across the fitted values. For example, we could see something like what is shown in the two plots above.
A Simple Solution
A simple solution would be to change your model so that it no longer assumes homogeneity of variance. This may sound difficult to do, but it’s actually very easy (at least in R). To do it, you’ll need to use the gls() function from the nlme library.
If your residual variance increases as a function of one of your predictors (like what’s shown in the second panel above), you could change the εi ~ N(0, σ2) in your model to εi ~ N(0, σ2(Xi)), where X is the predictor that changes as a function of the residual variance.
For example, if you have a regression model with a single continuous predictor and your residual variance increases as values of your predictor increase, then you could run the following R code to calculate a model with the changed error term:
fit1 <- gls(y~x, weights=varFixed(~x)) > summary(fit1) Generalized least squares fit by REML Model: y ~ x Data: contData AIC BIC logLik 9515.935 9530.652 -4754.967 Variance function: Structure: fixed weights Formula: ~x Coefficients: Value Std.Error t-value p-value (Intercept) 98.99807 1.3284134 74.5235 0 x 1.00176 0.0026904 372.3427 0 Correlation: (Intr) x -0.822 Standardized residuals: Min Q1 Med Q3 Max -3.33458441 -0.64276471 -0.04711188 0.64779512 3.67290659 Residual standard error: 1.187176 Degrees of freedom: 1000 total; 998 residual
How do you interpret this output? Well, you interpret it just like you would if it were output from a typical regression. Easy!
If your residual variance changes as a function of a categorical predictor (like what’s shown in the first panel above), you could include this in your model as well by changing εi ~ N(0, σ2) to εi ~ N(0, σ2j), where j represents each level of the categorical predictor. Basically, we’re giving each level of our categorical predictor its own residual variance.
Consider a regression model with a single categorical predictor (group) with 2 levels (Group 1 and Group 2). If the residual spread differs across levels of the predictor (the two groups), then you could run the following R code to calculate a model with the modified error term:
fit2 <- gls(y~group, weights=varIdent(form=~1|group)) > summary(fit2) Generalized least squares fit by REML Model: y ~ group Data: catData AIC BIC logLik 13627.53 13647.15 -6809.764 Variance function: Structure: Different standard deviations per stratum Formula: ~1 | group Parameter estimates: Group 1 Group 2 1.000000 4.953512 Coefficients: Value Std.Error t-value p-value (Intercept) 100.4819 4.441059 22.625669 0 groupGroup 2 197.9600 22.442637 8.820712 0 Correlation: (Intr) groupGroup 2 -0.198 Standardized residuals: Min Q1 Med Q3 Max -2.926579282 -0.666993810 0.003494385 0.639112804 2.891051620 Residual standard error: 99.30511 Degrees of freedom: 1000 total; 998 residual
Once again, you can interpret this output like you would with any normal regression. The only thing that is new is under “Variance function,” which has a number below it for each level of our categorical predictor: 1 and 4.95. When you multiply these values by the residual variance (99.31), you’ll get the estimated residual variance for each of your groups. (Group 1 = 99.31*1 = 99.31; Group 2 = 99.31*4.95 = 491.59).
If you want to learn more about modeling non-constant variance structures, then I highly recommend Zuur et al. (2009), which can be purchased on Amazon. Although, the book mentions Ecology in the title, you don’t actually need to know anything about Ecology to understand its material.
- Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. New York, NY: Springer.