Category Archives: Data Visualization

Boxplot With Overlaid Scatterplots in ggplot2: A Hack for Use with a Fill Variable

Sometimes I feel like only boxplots or only scatterplots aren’t sufficient to accurately convey what’s interesting or unique about my data set. Boxplots have clear advantages over simple bar graphs for displaying data that may not be normally distributed or in showing that variability differs between groups or levels of a factor. At the same time, although scatterplots are not typically used when there is one categorical and one continuous variable, they can be very useful in displaying unique distributions when used in combination with a boxplot.

The data set I am using in this example is from a driving simulator task where participants pressed a button on the steering wheel to indicate times when they thought it would be safe to turn left across traffic. Gaps in traffic varied from being too small to safely turn within to being so large that even the most conservative driver would judge it safe to turn. The dependent measure was the distance, in feet, from the driver to the oncoming vehicle they would have turned in front of while executing the turn. What was of interest was first, whether drivers of different ages differed in their tendency to make risky left-turn decisions, and second, whether drivers of different ages differed in how long they waited to turn once they’d decided it was safe to go. Therefore, BOTH the gap size during which the participant said it was safe to turn AND the actual distance of the oncoming vehicle the driver would have turned in front of are relevant. If drivers are waiting a long time to initiate a turn after deciding it is safe to go, even if they are turning during an 800 ft gap in traffic (which should be more than safe), they could still be initiating the turn so late that they end up turning when the oncoming vehicle is too close.

For those of you who would like to follow along with this example, here is the data file I am using: data_example.txt and here is my commented R code: boxplot_w_scatterplot_R.txt

First, we’ll need to load the ggplot2 library by typing the following, or running this line from the script window if you are using RStudio:

library(ggplot2)

Next, you will need to set your working directory to the folder in which you have placed the data file by running this line (remember to update the path to whatever directory you are going to use):

setwd("C:\\Users\\Ainsley\\Documents\\")

Alternatively, if you are already using a working directory you like, you can skip the working directory step and place the data file in the folder that is your current working directory.

To load the data file, assuming that it has been placed in the folder that is your current working directory:

data <- read.delim("data_example.txt",header=T,sep="\t")

Once the data file has been loaded, it is good practice to sort your factors so that they appear in the same order you want them to appear in your graph. In the current data set, I already know that the default ordering of the dist_cat and age_group factors isn’t what I want. The following code will sort those factors:

data$dist_cat <- factor(data$dist_cat,levels=c("Risky turn","Safe turn", "Cautious turn"))
data$age_group <- factor(data$age_group,levels=c("Younger","Older"))

Now I am ready to generate a boxplot. Below is the code for the pictured boxplot. Only the first two lines are needed to get a basic boxplot, the remaining lines change the x and y axis labels and modify the axis label and title text size, style, and color.

ggplot(data, aes(dist_cat,onc_dist)) +
    geom_boxplot() +
    xlab("Gap Size Category") +
    ylab("Oncoming Vehicle Distance") +
    theme(axis.title.x = element_text(face='bold',size=16,hjust=0.5),
          axis.title.y = element_text(face='bold',size=16,vjust=1),
          axis.text.x = element_text(face='bold',size=14,color='black'),
          axis.text.y = element_text(face='bold',size=14,color='black'))

Boxplot with a scatterplot overlaid 1.

Here, for the sake of the example, I have crunched the continuous variable “gap size” into a categorical value, “gap size category” (risky turn, safe turn, cautious turn). Although it is not necessarily proper to do this, please bear with me for the sake of the example.

The horizontal line in the boxplot shows the median oncoming vehicle distance for all responses in each gap size category. Not surprisingly, oncoming vehicles tend to be closer when drivers are turning within a smaller gap in traffic. This is not at all surprising. In fact, it would be a concern if this were not the case! However, we can also see that there are some differences in the variability of oncoming car distance between gap size categories, as well as some outliers in the safe and risky turn categories.

I’ll admit it, sometimes I like to be fancy. It occurred to me that I could make use of ggplot2’s ability to generate multi-level plots and overlay a cute scatterplot on my boxplots. That way someone looking at my graph can see not only important summary information (e.g. group median, interquartile range) but could also get a feel for the distribution of values.

To add plotted points over the boxplot, we need only add the geom_jitter command following geom_boxplot. Also, notice that I have added the argument “outlier.size=0” to the geom_boxplot command. This is to get rid of the default outlier points from the boxplot layer because those points will be plotted by the geom_jitter command. No need to plot them twice! To avoid overplotting in the geom_jitter layer, notice that I am applying jitter to the points, which randomly varies the x position of the points around some value. However, because I want the y values to be accurate, and they already vary enough on their own, I am not applying jitter to the height of the plotted points. In addition, I set the size of the points, set the alpha level so that the plotted points show up slightly lighter, and remove the point layer from the legend with “show_guide=FALSE”.

ggplot(data, aes(dist_cat, onc_dist)) +
    geom_boxplot(outlier.size=0) +
    geom_jitter(aes(dist_cat,onc_dist),
               position=position_jitter(width=0.1,height=0),
               alpha=0.6,
               size=3,
               show_guide=FALSE) +
    xlab("Gap Size Category") +
    ylab("Oncoming Vehicle Distance") +
    theme(axis.title.x = element_text(face='bold',size=16,hjust=0.5),
          axis.title.y = element_text(face='bold',size=16,vjust=1),
          axis.text.x = element_text(face='bold',size=14,color='black'),
          axis.text.y = element_text(face='bold',size=14,color='black'))

Boxplot with a scatterplot overlaid 2.

Recall that participants in this data set varied in age. Another question in that study was whether older and younger drivers differed in either the size gap in traffic during which they were comfortable turning or how long it took them to initiate a turn once they decided to go. Considering this, it would be cool if we could add age group (18 to 35 years vs. 65 years and older) to the plot. We can do that by adding the column age_group as a fill value in the aes for the geom_boxplot command. Also, because I do want age group shown in the legend, but I don’t want to use the exact column name as the legend title, I add scale_fill_discrete with “name=”Age Group” to adjust the legend title. I’ll also add lines to the “theme” section of my code to adjust the size and color of the legend title and text.

ggplot(data, aes(dist_cat, onc_dist, fill=age_group)) +
     geom_boxplot(outlier.size=0) +
     geom_jitter(aes(dist_cat,onc_dist),
         position=position_jitter(width=0.1,height=0),
         alpha=0.6,
         size=3,
         show_guide=FALSE) +
     xlab("Gap Size Category") +
     ylab("Oncoming Vehicle Distance") +
     scale_fill_discrete(name="Age Group") +
     theme(axis.title.x = element_text(face='bold',size=16,hjust=0.5),
        axis.title.y = element_text(face='bold',size=16,vjust=1),
        axis.text.x = element_text(face='bold',size=14,color='black'),
        axis.text.y = element_text(face='bold',size=14,color='black'),
        legend.title = element_text(face="bold", color="black", size=14),
        legend.text = element_text(face="bold", color="black", size=12))

Boxplot with a scatterplot overlaid 3.

Oh no! My pretty, pretty plot is ruined! What do I do now? I still want to have age group included, but I also want the points plotted over my boxplot. What to do?

Here’s where I can save you some time. I don’t know if this is a GOOD solution to the problem, but it is a workable solution. If someone who reads this has a better solution, please let me know.

I’m not going to type out all the things I did that did not produce the desired result. Instead, let’s just look at what I did that worked out for me. First, what I wanted to do was apply an adjustment to the points so that the older and younger adults values would be spread apart and over the correct part of the boxplot. Unfortunately, although ggplot2 treats categorical values like numbers in some ways (e.g. when you are telling it where to put an annotation), it doesn’t treat these values like numbers in all ways (e.g. you cannot do arithmetic on them).

You can probably see where this is going. The solution is to recode the x-axis values so that they are numeric, then apply a different adjustment value for each age group.

I don’t want to replace values in my dist_cat column, so I recode those into a new column called dist_cat_n:

data$dist_cat_n[data$dist_cat == "Risky turn"] <- 1
data$dist_cat_n[data$dist_cat == "Safe turn"] <- 2
data$dist_cat_n[data$dist_cat == "Cautious turn"] <- 3

Next, I create a new column with my adjustment values called scat_adj. I’m assigning the negative value to youngers because they are going to be on the left side and older adults get the positive value so they will appear on the right:

data$scat_adj[data$age_group == "Younger"] <- -0.20
data$scat_adj[data$age_group == "Older"] <- 0.20

Now, I need to modify my graph code to use my new columns in the geom_jitter layer. Because the dist_cat_n column is now numeric, ggplot2 is cool with applying the adjustment value for me.

ggplot(data, aes(dist_cat,onc_dist,fill=age_group)) +
    geom_boxplot(outlier.size=0) +
    geom_jitter(aes(dist_cat_n + scat_adj,onc_dist),
        position=position_jitter(width=0.1,height=0),
        alpha=0.6,
        size=3,
        show_guide=FALSE) +
    scale_fill_discrete(name="Age Group") +
    xlab("Gap Size Category") +
    ylab("Median Oncoming Vehicle Distance") +
    theme(axis.title.x = element_text(face='bold',size=16,hjust=0.5),
        axis.title.y = element_text(face='bold',size=16,vjust=1),
        axis.text.x = element_text(face='bold',size=14,color='black'),
        axis.text.y = element_text(face='bold',size=14,color='black'),
        legend.title = element_text(face="bold", color="black", size=14),
        legend.text = element_text(face="bold", color="black", size=12))

Boxplot with a scatterplot overlaid 3.

Voila! Now the graph appears as I intended. I can have my plotted points and display them, too.

Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

© AcuPsy 2015       info@acupsy.com       Load our Facebook page   Load our Twitter page   Load our LinkedIn page