Author Archives: Cary Stothart

A Simple Way of Dealing with Violations of the Homogeneity of Variance Assumption

The Problem

You just calculated a regression model and have just started assessing the residuals in order to determine whether or not they meet two of the assumptions of linear regression; that is, if they’re normally distributed and have a constant error variance. Much to your dismay, you find that your residuals have a NON-constant error variance. In other words, your residual plot looks similar to one of the ones shown below. Oh no!

non-constant-residuals

Why is this a problem? First, the standard errors for your predictor coefficients may now be incorrect. Second, your t and F statistics may no longer follow their proper distributions. In other words, you won’t be able to properly assess statistical significance (Zuur et al., 2009). Finally, any data you simulate from your model will be incorrect.

Where Does this Assumption Come From?

Below is the typical regression model:

Yi = β0 + β1X1i + β2X2i + … βpXpi + εi
εi ~ N(0, σ2)

The homogeneity of variance assumption is represented by εi ~ N(0, σ2). Basically, this is saying that each residual should come from a distribution with a mean of 0 and a variance of σ2. If this is true, then we should see a constant residual spread across the fitted values. If this is not true, then we would see differences in the residual spread across the fitted values. For example, we could see something like what is shown in the two plots above.

A Simple Solution

A simple solution would be to change your model so that it no longer assumes homogeneity of variance. This may sound difficult to do, but it’s actually very easy (at least in R). To do it, you’ll need to use the gls() function from the nlme library.

If your residual variance increases as a function of one of your predictors (like what’s shown in the second panel above), you could change the εi ~ N(0, σ2) in your model to εi ~ N(0, σ2(Xi)), where X is the predictor that changes as a function of the residual variance.

For example, if you have a regression model with a single continuous predictor and your residual variance increases as values of your predictor increase, then you could run the following R code to calculate a model with the changed error term:

fit1 <- gls(y~x, weights=varFixed(~x))

> summary(fit1)
Generalized least squares fit by REML
  Model: y ~ x 
  Data: contData 
       AIC      BIC    logLik
  9515.935 9530.652 -4754.967

Variance function:
 Structure: fixed weights
 Formula: ~x 

Coefficients:
               Value Std.Error  t-value p-value
(Intercept) 98.99807 1.3284134  74.5235       0
x            1.00176 0.0026904 372.3427       0

 Correlation: 
  (Intr)
x -0.822

Standardized residuals:
        Min          Q1         Med          Q3         Max 
-3.33458441 -0.64276471 -0.04711188  0.64779512  3.67290659 

Residual standard error: 1.187176 
Degrees of freedom: 1000 total; 998 residual

How do you interpret this output? Well, you interpret it just like you would if it were output from a typical regression. Easy!

If your residual variance changes as a function of a categorical predictor (like what’s shown in the first panel above), you could include this in your model as well by changing εi ~ N(0, σ2) to εi ~ N(0, σ2j), where j represents each level of the categorical predictor. Basically, we’re giving each level of our categorical predictor its own residual variance.

Consider a regression model with a single categorical predictor (group) with 2 levels (Group 1 and Group 2). If the residual spread differs across levels of the predictor (the two groups), then you could run the following R code to calculate a model with the modified error term:

fit2 <- gls(y~group, weights=varIdent(form=~1|group))

> summary(fit2)
Generalized least squares fit by REML
  Model: y ~ group 
  Data: catData 
       AIC      BIC    logLik
  13627.53 13647.15 -6809.764

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | group 
 Parameter estimates:
 Group 1  Group 2 
1.000000 4.953512 

Coefficients:
                Value Std.Error   t-value p-value
(Intercept)  100.4819  4.441059 22.625669       0
groupGroup 2 197.9600 22.442637  8.820712       0

 Correlation: 
             (Intr)
groupGroup 2 -0.198

Standardized residuals:
         Min           Q1          Med           Q3          Max 
-2.926579282 -0.666993810  0.003494385  0.639112804  2.891051620 

Residual standard error: 99.30511 
Degrees of freedom: 1000 total; 998 residual

Once again, you can interpret this output like you would with any normal regression. The only thing that is new is under “Variance function,” which has a number below it for each level of our categorical predictor: 1 and 4.95. When you multiply these values by the residual variance (99.31), you’ll get the estimated residual variance for each of your groups. (Group 1 = 99.31*1 = 99.31; Group 2 = 99.31*4.95 = 491.59).

If you want to learn more about modeling non-constant variance structures, then I highly recommend Zuur et al. (2009), which can be purchased on Amazon. Although, the book mentions Ecology in the title, you don’t actually need to know anything about Ecology to understand its material.

Reference

  1. Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. New York, NY: Springer.
Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

How to Merge Text Files by Rows in Python

Let’s say that you have 1000 text files that contain the same columns. For example:

File 1:

dataset  age    x        y
1        43     0.937    0.266
1        40     0.007    1.954
1        35     0.596    0.285
1        38     0.387    1.488
1        41     0.785    1.719

File 2:

dataset  age    x        y
2        55     1.243    0.104
2        23     0.900    2.093
2        32     0.743    0.001
2        56     1.213    0.754
2        39     0.842    0.342

. . .

File 1000:

dataset    age    x        y
1000       93     0.012    0.234
1000       34     0.453    0.032
1000       24     0.232    1.043
1000       56     0.123    0.343
1000       98     1.293    0.123

And let’s say that you wanted to merge all of these files by rows; that is, place them on top of each other. How would you do this? One way is with Python (we used version 2.7.8):

import glob

output_file = open("merged_file.txt", "w")
file_list = glob.glob("*.txt")
first_file = open(file_list[0], "r")
output_file.write(first_file.readlines()[0])
first_file.close()
for file in file_list:
    file = open(file, "r")
    file_rows = file.readlines()
    for row in file_rows[1:]:
        output_file.write(row)
    file.close()

Running this code returns a text file that contains the merged data (merged_file.txt). Going with our example, the merged file would look like this (with files 3 through 999 excluded because of space constraints):

dataset  age    x        y
1        43     0.937    0.266
1        40     0.007    1.954
1        35     0.596    0.285
1        38     0.387    1.488
1        41     0.785    1.719
2        55     1.243    0.104
2        23     0.900    2.093
2        32     0.743    0.001
2        56     1.213    0.754
2        39     0.842    0.342
. . .
1000       93     0.012    0.234
1000       34     0.453    0.032
1000       24     0.232    1.043
1000       56     0.123    0.343
1000       98     1.293    0.123

For the program to work, it needs to be placed in the same folder that contains the text files to be merged. Here’s how the program works:

  1. Creates a file that will contain all of the merged data.
    output_file = open("merged_file.txt", "w")
  2. Creates a list of all of the text files that share the same folder as the program.
    file_list = glob.glob("*.txt")
  3. Opens the first file in the list, converts it to a list with each element of the list being a line in the file, and then writes the first row of the file (the row containing the column headings) to the file that will contain all of the merged data. It then closes the first file.
    first_file = open(file_list[0], "r")
    output_file.write(first_file.readlines()[0])
    first_file.close()
  4. Iterates through each file in the file list and does the following:
    1. Opens the file and converts it to a list with each line of the file being an element in the list.
      file = open(file, "r")
      file_rows = file.readlines()
    2. Copies all of the rows to the merged file except for the first (because we don’t want to copy the header row again).
      for row in file[1:]:
          output_file.write(row)
    3. Closes the file.
      file.close()
Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

How Much Can a Sample Effect Size Deviate From the Population Effect Size?

You just completed a research experiment, did the data analysis, and found that you have a large effect size that is significant at the .05 level. “This is great!” you say to yourself as you start to prepare a submission of your findings for Science. Is it possible, however, that the true effect size; that is, the effect size in the population, is different from the effect size you calculated for your sample? “Sure, but it couldn’t be too different,” you may say. Think again. The plot below shows that a sample effect size may deviate substantially from the population effect size. In fact, you may find a very large effect in your data when there is actually no effect in the population.

Sample effect size as a function of population effect size and sample size.

The plot was made by simulating 80,000 datasets containing two groups that had either 10 or 30 observations each and had a population standardized mean difference (Cohen’s d) of either 0, 0.20, 0.50, or 0.80. For each dataset simulated, an effect size was calculated (sample effect size), and these effect sizes were plotted against the population effect sizes. Each point in the plot represents the effect size calculated for a single dataset.

This plot demonstrates two important points. First, it shows that a sample effect size can deviate substantially from the true effect size in the population. This is why you should always provide confidence intervals when presenting any findings: Because confidence intervals tell you what values for the population effect size estimate are plausible. Second, it shows that a sample effect size’s deviation from the population effect size tends to be less the larger the sample size is. This is why it’s almost always a good idea to use large sample sizes: The larger the sample size, the more precise your estimate of the population effect size is.

For those interested, here’s the R code that was used to both simulate the data and create the plot: effect-size-r-code.txt. Be warned, though, that running the code may take some time!

Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

Simulating Data for a Regression Model in R

Simulating data for use in a linear regression model is fairly straightforward in R because of how the language’s vector operations work.

For instance, to simulate 1000 observations for the following population model:

Yi = 5 + X1i(0.03) + X2i(0.05) + Εi
Εi ~ N(0, 0.25)

We could run the following R code:

n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- 5 + x1*.03 + x2*.05 + rnorm(n, m=0, sd=.25)

Notice that to calculate y, we only had to enter our regression model. If we calculated a regression using our simulated data, we should see something similar to the following:

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.71103 -0.17190 -0.00386  0.17086  0.84411 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.010316   0.007763 645.370  < 2e-16 ***
x1          0.038880   0.007751   5.016 6.24e-07 ***
x2          0.049713   0.007548   6.586 7.28e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2455 on 997 degrees of freedom
Multiple R-squared:  0.06377,	Adjusted R-squared:  0.06189 
F-statistic: 33.95 on 2 and 997 DF,  p-value: 5.436e-15

Notice that the coefficients and the residual standard error are close to those in our population model.

Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

© AcuPsy 2015       info@acupsy.com       Load our Facebook page   Load our Twitter page   Load our LinkedIn page