Category Archives: Regression

A Simple Way of Dealing with Violations of the Homogeneity of Variance Assumption

The Problem

You just calculated a regression model and have just started assessing the residuals in order to determine whether or not they meet two of the assumptions of linear regression; that is, if they’re normally distributed and have a constant error variance. Much to your dismay, you find that your residuals have a NON-constant error variance. In other words, your residual plot looks similar to one of the ones shown below. Oh no!

non-constant-residuals

Why is this a problem? First, the standard errors for your predictor coefficients may now be incorrect. Second, your t and F statistics may no longer follow their proper distributions. In other words, you won’t be able to properly assess statistical significance (Zuur et al., 2009). Finally, any data you simulate from your model will be incorrect.

Where Does this Assumption Come From?

Below is the typical regression model:

Yi = β0 + β1X1i + β2X2i + … βpXpi + εi
εi ~ N(0, σ2)

The homogeneity of variance assumption is represented by εi ~ N(0, σ2). Basically, this is saying that each residual should come from a distribution with a mean of 0 and a variance of σ2. If this is true, then we should see a constant residual spread across the fitted values. If this is not true, then we would see differences in the residual spread across the fitted values. For example, we could see something like what is shown in the two plots above.

A Simple Solution

A simple solution would be to change your model so that it no longer assumes homogeneity of variance. This may sound difficult to do, but it’s actually very easy (at least in R). To do it, you’ll need to use the gls() function from the nlme library.

If your residual variance increases as a function of one of your predictors (like what’s shown in the second panel above), you could change the εi ~ N(0, σ2) in your model to εi ~ N(0, σ2(Xi)), where X is the predictor that changes as a function of the residual variance.

For example, if you have a regression model with a single continuous predictor and your residual variance increases as values of your predictor increase, then you could run the following R code to calculate a model with the changed error term:

fit1 <- gls(y~x, weights=varFixed(~x))

> summary(fit1)
Generalized least squares fit by REML
  Model: y ~ x 
  Data: contData 
       AIC      BIC    logLik
  9515.935 9530.652 -4754.967

Variance function:
 Structure: fixed weights
 Formula: ~x 

Coefficients:
               Value Std.Error  t-value p-value
(Intercept) 98.99807 1.3284134  74.5235       0
x            1.00176 0.0026904 372.3427       0

 Correlation: 
  (Intr)
x -0.822

Standardized residuals:
        Min          Q1         Med          Q3         Max 
-3.33458441 -0.64276471 -0.04711188  0.64779512  3.67290659 

Residual standard error: 1.187176 
Degrees of freedom: 1000 total; 998 residual

How do you interpret this output? Well, you interpret it just like you would if it were output from a typical regression. Easy!

If your residual variance changes as a function of a categorical predictor (like what’s shown in the first panel above), you could include this in your model as well by changing εi ~ N(0, σ2) to εi ~ N(0, σ2j), where j represents each level of the categorical predictor. Basically, we’re giving each level of our categorical predictor its own residual variance.

Consider a regression model with a single categorical predictor (group) with 2 levels (Group 1 and Group 2). If the residual spread differs across levels of the predictor (the two groups), then you could run the following R code to calculate a model with the modified error term:

fit2 <- gls(y~group, weights=varIdent(form=~1|group))

> summary(fit2)
Generalized least squares fit by REML
  Model: y ~ group 
  Data: catData 
       AIC      BIC    logLik
  13627.53 13647.15 -6809.764

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | group 
 Parameter estimates:
 Group 1  Group 2 
1.000000 4.953512 

Coefficients:
                Value Std.Error   t-value p-value
(Intercept)  100.4819  4.441059 22.625669       0
groupGroup 2 197.9600 22.442637  8.820712       0

 Correlation: 
             (Intr)
groupGroup 2 -0.198

Standardized residuals:
         Min           Q1          Med           Q3          Max 
-2.926579282 -0.666993810  0.003494385  0.639112804  2.891051620 

Residual standard error: 99.30511 
Degrees of freedom: 1000 total; 998 residual

Once again, you can interpret this output like you would with any normal regression. The only thing that is new is under “Variance function,” which has a number below it for each level of our categorical predictor: 1 and 4.95. When you multiply these values by the residual variance (99.31), you’ll get the estimated residual variance for each of your groups. (Group 1 = 99.31*1 = 99.31; Group 2 = 99.31*4.95 = 491.59).

If you want to learn more about modeling non-constant variance structures, then I highly recommend Zuur et al. (2009), which can be purchased on Amazon. Although, the book mentions Ecology in the title, you don’t actually need to know anything about Ecology to understand its material.

Reference

  1. Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. New York, NY: Springer.
Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

Simulating Data for a Regression Model in R

Simulating data for use in a linear regression model is fairly straightforward in R because of how the language’s vector operations work.

For instance, to simulate 1000 observations for the following population model:

Yi = 5 + X1i(0.03) + X2i(0.05) + Εi
Εi ~ N(0, 0.25)

We could run the following R code:

n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- 5 + x1*.03 + x2*.05 + rnorm(n, m=0, sd=.25)

Notice that to calculate y, we only had to enter our regression model. If we calculated a regression using our simulated data, we should see something similar to the following:

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.71103 -0.17190 -0.00386  0.17086  0.84411 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.010316   0.007763 645.370  < 2e-16 ***
x1          0.038880   0.007751   5.016 6.24e-07 ***
x2          0.049713   0.007548   6.586 7.28e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2455 on 997 degrees of freedom
Multiple R-squared:  0.06377,	Adjusted R-squared:  0.06189 
F-statistic: 33.95 on 2 and 997 DF,  p-value: 5.436e-15

Notice that the coefficients and the residual standard error are close to those in our population model.

Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

© AcuPsy 2015       info@acupsy.com       Load our Facebook page   Load our Twitter page   Load our LinkedIn page