Monthly Archives: March 2015

A Simple Way of Dealing with Violations of the Homogeneity of Variance Assumption

The Problem

You just calculated a regression model and have just started assessing the residuals in order to determine whether or not they meet two of the assumptions of linear regression; that is, if they’re normally distributed and have a constant error variance. Much to your dismay, you find that your residuals have a NON-constant error variance. In other words, your residual plot looks similar to one of the ones shown below. Oh no!

non-constant-residuals

Why is this a problem? First, the standard errors for your predictor coefficients may now be incorrect. Second, your t and F statistics may no longer follow their proper distributions. In other words, you won’t be able to properly assess statistical significance (Zuur et al., 2009). Finally, any data you simulate from your model will be incorrect.

Where Does this Assumption Come From?

Below is the typical regression model:

Yi = β0 + β1X1i + β2X2i + … βpXpi + εi
εi ~ N(0, σ2)

The homogeneity of variance assumption is represented by εi ~ N(0, σ2). Basically, this is saying that each residual should come from a distribution with a mean of 0 and a variance of σ2. If this is true, then we should see a constant residual spread across the fitted values. If this is not true, then we would see differences in the residual spread across the fitted values. For example, we could see something like what is shown in the two plots above.

A Simple Solution

A simple solution would be to change your model so that it no longer assumes homogeneity of variance. This may sound difficult to do, but it’s actually very easy (at least in R). To do it, you’ll need to use the gls() function from the nlme library.

If your residual variance increases as a function of one of your predictors (like what’s shown in the second panel above), you could change the εi ~ N(0, σ2) in your model to εi ~ N(0, σ2(Xi)), where X is the predictor that changes as a function of the residual variance.

For example, if you have a regression model with a single continuous predictor and your residual variance increases as values of your predictor increase, then you could run the following R code to calculate a model with the changed error term:

fit1 <- gls(y~x, weights=varFixed(~x))

> summary(fit1)
Generalized least squares fit by REML
  Model: y ~ x 
  Data: contData 
       AIC      BIC    logLik
  9515.935 9530.652 -4754.967

Variance function:
 Structure: fixed weights
 Formula: ~x 

Coefficients:
               Value Std.Error  t-value p-value
(Intercept) 98.99807 1.3284134  74.5235       0
x            1.00176 0.0026904 372.3427       0

 Correlation: 
  (Intr)
x -0.822

Standardized residuals:
        Min          Q1         Med          Q3         Max 
-3.33458441 -0.64276471 -0.04711188  0.64779512  3.67290659 

Residual standard error: 1.187176 
Degrees of freedom: 1000 total; 998 residual

How do you interpret this output? Well, you interpret it just like you would if it were output from a typical regression. Easy!

If your residual variance changes as a function of a categorical predictor (like what’s shown in the first panel above), you could include this in your model as well by changing εi ~ N(0, σ2) to εi ~ N(0, σ2j), where j represents each level of the categorical predictor. Basically, we’re giving each level of our categorical predictor its own residual variance.

Consider a regression model with a single categorical predictor (group) with 2 levels (Group 1 and Group 2). If the residual spread differs across levels of the predictor (the two groups), then you could run the following R code to calculate a model with the modified error term:

fit2 <- gls(y~group, weights=varIdent(form=~1|group))

> summary(fit2)
Generalized least squares fit by REML
  Model: y ~ group 
  Data: catData 
       AIC      BIC    logLik
  13627.53 13647.15 -6809.764

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | group 
 Parameter estimates:
 Group 1  Group 2 
1.000000 4.953512 

Coefficients:
                Value Std.Error   t-value p-value
(Intercept)  100.4819  4.441059 22.625669       0
groupGroup 2 197.9600 22.442637  8.820712       0

 Correlation: 
             (Intr)
groupGroup 2 -0.198

Standardized residuals:
         Min           Q1          Med           Q3          Max 
-2.926579282 -0.666993810  0.003494385  0.639112804  2.891051620 

Residual standard error: 99.30511 
Degrees of freedom: 1000 total; 998 residual

Once again, you can interpret this output like you would with any normal regression. The only thing that is new is under “Variance function,” which has a number below it for each level of our categorical predictor: 1 and 4.95. When you multiply these values by the residual variance (99.31), you’ll get the estimated residual variance for each of your groups. (Group 1 = 99.31*1 = 99.31; Group 2 = 99.31*4.95 = 491.59).

If you want to learn more about modeling non-constant variance structures, then I highly recommend Zuur et al. (2009), which can be purchased on Amazon. Although, the book mentions Ecology in the title, you don’t actually need to know anything about Ecology to understand its material.

Reference

  1. Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. New York, NY: Springer.
Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

Boxplot With Overlaid Scatterplots in ggplot2: A Hack for Use with a Fill Variable

Sometimes I feel like only boxplots or only scatterplots aren’t sufficient to accurately convey what’s interesting or unique about my data set. Boxplots have clear advantages over simple bar graphs for displaying data that may not be normally distributed or in showing that variability differs between groups or levels of a factor. At the same time, although scatterplots are not typically used when there is one categorical and one continuous variable, they can be very useful in displaying unique distributions when used in combination with a boxplot.

The data set I am using in this example is from a driving simulator task where participants pressed a button on the steering wheel to indicate times when they thought it would be safe to turn left across traffic. Gaps in traffic varied from being too small to safely turn within to being so large that even the most conservative driver would judge it safe to turn. The dependent measure was the distance, in feet, from the driver to the oncoming vehicle they would have turned in front of while executing the turn. What was of interest was first, whether drivers of different ages differed in their tendency to make risky left-turn decisions, and second, whether drivers of different ages differed in how long they waited to turn once they’d decided it was safe to go. Therefore, BOTH the gap size during which the participant said it was safe to turn AND the actual distance of the oncoming vehicle the driver would have turned in front of are relevant. If drivers are waiting a long time to initiate a turn after deciding it is safe to go, even if they are turning during an 800 ft gap in traffic (which should be more than safe), they could still be initiating the turn so late that they end up turning when the oncoming vehicle is too close.

For those of you who would like to follow along with this example, here is the data file I am using: data_example.txt and here is my commented R code: boxplot_w_scatterplot_R.txt

First, we’ll need to load the ggplot2 library by typing the following, or running this line from the script window if you are using RStudio:

library(ggplot2)

Next, you will need to set your working directory to the folder in which you have placed the data file by running this line (remember to update the path to whatever directory you are going to use):

setwd("C:\\Users\\Ainsley\\Documents\\")

Alternatively, if you are already using a working directory you like, you can skip the working directory step and place the data file in the folder that is your current working directory.

To load the data file, assuming that it has been placed in the folder that is your current working directory:

data <- read.delim("data_example.txt",header=T,sep="\t")

Once the data file has been loaded, it is good practice to sort your factors so that they appear in the same order you want them to appear in your graph. In the current data set, I already know that the default ordering of the dist_cat and age_group factors isn’t what I want. The following code will sort those factors:

data$dist_cat <- factor(data$dist_cat,levels=c("Risky turn","Safe turn", "Cautious turn"))
data$age_group <- factor(data$age_group,levels=c("Younger","Older"))

Now I am ready to generate a boxplot. Below is the code for the pictured boxplot. Only the first two lines are needed to get a basic boxplot, the remaining lines change the x and y axis labels and modify the axis label and title text size, style, and color.

ggplot(data, aes(dist_cat,onc_dist)) +
    geom_boxplot() +
    xlab("Gap Size Category") +
    ylab("Oncoming Vehicle Distance") +
    theme(axis.title.x = element_text(face='bold',size=16,hjust=0.5),
          axis.title.y = element_text(face='bold',size=16,vjust=1),
          axis.text.x = element_text(face='bold',size=14,color='black'),
          axis.text.y = element_text(face='bold',size=14,color='black'))

Boxplot with a scatterplot overlaid 1.

Here, for the sake of the example, I have crunched the continuous variable “gap size” into a categorical value, “gap size category” (risky turn, safe turn, cautious turn). Although it is not necessarily proper to do this, please bear with me for the sake of the example.

The horizontal line in the boxplot shows the median oncoming vehicle distance for all responses in each gap size category. Not surprisingly, oncoming vehicles tend to be closer when drivers are turning within a smaller gap in traffic. This is not at all surprising. In fact, it would be a concern if this were not the case! However, we can also see that there are some differences in the variability of oncoming car distance between gap size categories, as well as some outliers in the safe and risky turn categories.

I’ll admit it, sometimes I like to be fancy. It occurred to me that I could make use of ggplot2’s ability to generate multi-level plots and overlay a cute scatterplot on my boxplots. That way someone looking at my graph can see not only important summary information (e.g. group median, interquartile range) but could also get a feel for the distribution of values.

To add plotted points over the boxplot, we need only add the geom_jitter command following geom_boxplot. Also, notice that I have added the argument “outlier.size=0” to the geom_boxplot command. This is to get rid of the default outlier points from the boxplot layer because those points will be plotted by the geom_jitter command. No need to plot them twice! To avoid overplotting in the geom_jitter layer, notice that I am applying jitter to the points, which randomly varies the x position of the points around some value. However, because I want the y values to be accurate, and they already vary enough on their own, I am not applying jitter to the height of the plotted points. In addition, I set the size of the points, set the alpha level so that the plotted points show up slightly lighter, and remove the point layer from the legend with “show_guide=FALSE”.

ggplot(data, aes(dist_cat, onc_dist)) +
    geom_boxplot(outlier.size=0) +
    geom_jitter(aes(dist_cat,onc_dist),
               position=position_jitter(width=0.1,height=0),
               alpha=0.6,
               size=3,
               show_guide=FALSE) +
    xlab("Gap Size Category") +
    ylab("Oncoming Vehicle Distance") +
    theme(axis.title.x = element_text(face='bold',size=16,hjust=0.5),
          axis.title.y = element_text(face='bold',size=16,vjust=1),
          axis.text.x = element_text(face='bold',size=14,color='black'),
          axis.text.y = element_text(face='bold',size=14,color='black'))

Boxplot with a scatterplot overlaid 2.

Recall that participants in this data set varied in age. Another question in that study was whether older and younger drivers differed in either the size gap in traffic during which they were comfortable turning or how long it took them to initiate a turn once they decided to go. Considering this, it would be cool if we could add age group (18 to 35 years vs. 65 years and older) to the plot. We can do that by adding the column age_group as a fill value in the aes for the geom_boxplot command. Also, because I do want age group shown in the legend, but I don’t want to use the exact column name as the legend title, I add scale_fill_discrete with “name=”Age Group” to adjust the legend title. I’ll also add lines to the “theme” section of my code to adjust the size and color of the legend title and text.

ggplot(data, aes(dist_cat, onc_dist, fill=age_group)) +
     geom_boxplot(outlier.size=0) +
     geom_jitter(aes(dist_cat,onc_dist),
         position=position_jitter(width=0.1,height=0),
         alpha=0.6,
         size=3,
         show_guide=FALSE) +
     xlab("Gap Size Category") +
     ylab("Oncoming Vehicle Distance") +
     scale_fill_discrete(name="Age Group") +
     theme(axis.title.x = element_text(face='bold',size=16,hjust=0.5),
        axis.title.y = element_text(face='bold',size=16,vjust=1),
        axis.text.x = element_text(face='bold',size=14,color='black'),
        axis.text.y = element_text(face='bold',size=14,color='black'),
        legend.title = element_text(face="bold", color="black", size=14),
        legend.text = element_text(face="bold", color="black", size=12))

Boxplot with a scatterplot overlaid 3.

Oh no! My pretty, pretty plot is ruined! What do I do now? I still want to have age group included, but I also want the points plotted over my boxplot. What to do?

Here’s where I can save you some time. I don’t know if this is a GOOD solution to the problem, but it is a workable solution. If someone who reads this has a better solution, please let me know.

I’m not going to type out all the things I did that did not produce the desired result. Instead, let’s just look at what I did that worked out for me. First, what I wanted to do was apply an adjustment to the points so that the older and younger adults values would be spread apart and over the correct part of the boxplot. Unfortunately, although ggplot2 treats categorical values like numbers in some ways (e.g. when you are telling it where to put an annotation), it doesn’t treat these values like numbers in all ways (e.g. you cannot do arithmetic on them).

You can probably see where this is going. The solution is to recode the x-axis values so that they are numeric, then apply a different adjustment value for each age group.

I don’t want to replace values in my dist_cat column, so I recode those into a new column called dist_cat_n:

data$dist_cat_n[data$dist_cat == "Risky turn"] <- 1
data$dist_cat_n[data$dist_cat == "Safe turn"] <- 2
data$dist_cat_n[data$dist_cat == "Cautious turn"] <- 3

Next, I create a new column with my adjustment values called scat_adj. I’m assigning the negative value to youngers because they are going to be on the left side and older adults get the positive value so they will appear on the right:

data$scat_adj[data$age_group == "Younger"] <- -0.20
data$scat_adj[data$age_group == "Older"] <- 0.20

Now, I need to modify my graph code to use my new columns in the geom_jitter layer. Because the dist_cat_n column is now numeric, ggplot2 is cool with applying the adjustment value for me.

ggplot(data, aes(dist_cat,onc_dist,fill=age_group)) +
    geom_boxplot(outlier.size=0) +
    geom_jitter(aes(dist_cat_n + scat_adj,onc_dist),
        position=position_jitter(width=0.1,height=0),
        alpha=0.6,
        size=3,
        show_guide=FALSE) +
    scale_fill_discrete(name="Age Group") +
    xlab("Gap Size Category") +
    ylab("Median Oncoming Vehicle Distance") +
    theme(axis.title.x = element_text(face='bold',size=16,hjust=0.5),
        axis.title.y = element_text(face='bold',size=16,vjust=1),
        axis.text.x = element_text(face='bold',size=14,color='black'),
        axis.text.y = element_text(face='bold',size=14,color='black'),
        legend.title = element_text(face="bold", color="black", size=14),
        legend.text = element_text(face="bold", color="black", size=12))

Boxplot with a scatterplot overlaid 3.

Voila! Now the graph appears as I intended. I can have my plotted points and display them, too.

Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

How to Merge Text Files by Rows in Python

Let’s say that you have 1000 text files that contain the same columns. For example:

File 1:

dataset  age    x        y
1        43     0.937    0.266
1        40     0.007    1.954
1        35     0.596    0.285
1        38     0.387    1.488
1        41     0.785    1.719

File 2:

dataset  age    x        y
2        55     1.243    0.104
2        23     0.900    2.093
2        32     0.743    0.001
2        56     1.213    0.754
2        39     0.842    0.342

. . .

File 1000:

dataset    age    x        y
1000       93     0.012    0.234
1000       34     0.453    0.032
1000       24     0.232    1.043
1000       56     0.123    0.343
1000       98     1.293    0.123

And let’s say that you wanted to merge all of these files by rows; that is, place them on top of each other. How would you do this? One way is with Python (we used version 2.7.8):

import glob

output_file = open("merged_file.txt", "w")
file_list = glob.glob("*.txt")
first_file = open(file_list[0], "r")
output_file.write(first_file.readlines()[0])
first_file.close()
for file in file_list:
    file = open(file, "r")
    file_rows = file.readlines()
    for row in file_rows[1:]:
        output_file.write(row)
    file.close()

Running this code returns a text file that contains the merged data (merged_file.txt). Going with our example, the merged file would look like this (with files 3 through 999 excluded because of space constraints):

dataset  age    x        y
1        43     0.937    0.266
1        40     0.007    1.954
1        35     0.596    0.285
1        38     0.387    1.488
1        41     0.785    1.719
2        55     1.243    0.104
2        23     0.900    2.093
2        32     0.743    0.001
2        56     1.213    0.754
2        39     0.842    0.342
. . .
1000       93     0.012    0.234
1000       34     0.453    0.032
1000       24     0.232    1.043
1000       56     0.123    0.343
1000       98     1.293    0.123

For the program to work, it needs to be placed in the same folder that contains the text files to be merged. Here’s how the program works:

  1. Creates a file that will contain all of the merged data.
    output_file = open("merged_file.txt", "w")
  2. Creates a list of all of the text files that share the same folder as the program.
    file_list = glob.glob("*.txt")
  3. Opens the first file in the list, converts it to a list with each element of the list being a line in the file, and then writes the first row of the file (the row containing the column headings) to the file that will contain all of the merged data. It then closes the first file.
    first_file = open(file_list[0], "r")
    output_file.write(first_file.readlines()[0])
    first_file.close()
  4. Iterates through each file in the file list and does the following:
    1. Opens the file and converts it to a list with each line of the file being an element in the list.
      file = open(file, "r")
      file_rows = file.readlines()
    2. Copies all of the rows to the merged file except for the first (because we don’t want to copy the header row again).
      for row in file[1:]:
          output_file.write(row)
    3. Closes the file.
      file.close()
Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

How Much Can a Sample Effect Size Deviate From the Population Effect Size?

You just completed a research experiment, did the data analysis, and found that you have a large effect size that is significant at the .05 level. “This is great!” you say to yourself as you start to prepare a submission of your findings for Science. Is it possible, however, that the true effect size; that is, the effect size in the population, is different from the effect size you calculated for your sample? “Sure, but it couldn’t be too different,” you may say. Think again. The plot below shows that a sample effect size may deviate substantially from the population effect size. In fact, you may find a very large effect in your data when there is actually no effect in the population.

Sample effect size as a function of population effect size and sample size.

The plot was made by simulating 80,000 datasets containing two groups that had either 10 or 30 observations each and had a population standardized mean difference (Cohen’s d) of either 0, 0.20, 0.50, or 0.80. For each dataset simulated, an effect size was calculated (sample effect size), and these effect sizes were plotted against the population effect sizes. Each point in the plot represents the effect size calculated for a single dataset.

This plot demonstrates two important points. First, it shows that a sample effect size can deviate substantially from the true effect size in the population. This is why you should always provide confidence intervals when presenting any findings: Because confidence intervals tell you what values for the population effect size estimate are plausible. Second, it shows that a sample effect size’s deviation from the population effect size tends to be less the larger the sample size is. This is why it’s almost always a good idea to use large sample sizes: The larger the sample size, the more precise your estimate of the population effect size is.

For those interested, here’s the R code that was used to both simulate the data and create the plot: effect-size-r-code.txt. Be warned, though, that running the code may take some time!

Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

Simulating Data for a Regression Model in R

Simulating data for use in a linear regression model is fairly straightforward in R because of how the language’s vector operations work.

For instance, to simulate 1000 observations for the following population model:

Yi = 5 + X1i(0.03) + X2i(0.05) + Εi
Εi ~ N(0, 0.25)

We could run the following R code:

n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- 5 + x1*.03 + x2*.05 + rnorm(n, m=0, sd=.25)

Notice that to calculate y, we only had to enter our regression model. If we calculated a regression using our simulated data, we should see something similar to the following:

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.71103 -0.17190 -0.00386  0.17086  0.84411 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.010316   0.007763 645.370  < 2e-16 ***
x1          0.038880   0.007751   5.016 6.24e-07 ***
x2          0.049713   0.007548   6.586 7.28e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2455 on 997 degrees of freedom
Multiple R-squared:  0.06377,	Adjusted R-squared:  0.06189 
F-statistic: 33.95 on 2 and 997 DF,  p-value: 5.436e-15

Notice that the coefficients and the residual standard error are close to those in our population model.

Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

© AcuPsy 2015       info@acupsy.com       Load our Facebook page   Load our Twitter page   Load our LinkedIn page