Monthly Archives: November 2016

Dual-axis graphs in R base graphics

In most graph-making situations, it is better to favor simplicity over an all-out display of your insane graphing skills. However, there are times when it may be useful to create graphs that include a second y-axis. Until recently, the capability to create dual-axis graphs wasn’t a built-in function in the ggplot2 package. A secondary axis could still be added through the use of some clever hacks (e.g. Dual axis function for ggplot2), but these solutions would be intimidating for most beginning R programmers.

The most recent release of the ggplot2 package (version 2.2.0; November 14, 2016: ggplot2 2.2.0) now includes the ability to add a secondary axis via the sec.axis argument to scale_x_continuous() or scale_y_continuous(). I haven’t had time to try the new secondary scale function myself, but previous experience with ggplot2 graphing is that this method of creating a secondary axis will be quick and simple to use, possibly even simpler than the way this is accomplished in R base plots.

Despite the awesomeness of ggplot2, it is likely that some will still continue to find R base graphics to be the best solution for their dual-axis graphing needs. As such, I thought a post about how to do this might still be of value. As a bonus, I will also briefly describe how to rescale variables. This is can be helpful if one of your variables has a very different value range from others shown on the same scale. Rescaling, when used with a secondary axis, can kind of let you “zoom in” on one variable without messing up the scale for the others.

NOTE: All of this code is intended to be run as one chunk, but I’m separating it into smaller snippets to make it easier to follow the explanation for each part.

Adjust your plot margins. Before you begin, you will want to adjust the plot margins. Otherwise, the axis label for the secondary y-axis will be cut off.

par(mar = c(5,5,2,5))

Looking at the arguments passed to mar, these set the bottom, left, top and right margins of the plot. The units are “lines of text”, which isn’t exactly intuitive to use. You can also use mai instead of mar to set the margins in inches, if that is your preference. Regardless of whether I use mar or mai, I typically end up doing some amount of trial and error to get the plot to look how I want, so I just stick with mar.

Write the code for the primary plot. Next, build the main plot. This is the part of the plot with your primary y-axis. Here you can also set the primary x- and y-axis labels. This graph is essentially a histogram using lines. The data set used to create the graph includes counts of the number of cases at each value of VARIABLE 1 for all cases and a subset of cases that meet a given criterion.

with(myData,
plot(VARIABLE1, COUNT1_ALL, type="l",
col="black", cex=1.2, lwd=2,
xlab="VARIABLE 1 VALUE",
ylab="NUMBER OF CASES",
ylim=c(0,10000)))

The use of with simplifies the code by not requiring me to type out the data set name in front of each variable (e.g. type out myData$VARIABLE1 rather than simply VARIABLE1). The only down side to doing this is that you have an extra set of parentheses to keep track of, but if you are using the editor in RStudio, you can click on any parentheses to see its corresponding match in the code, so that’s not a big deal.

The other arguments modify the look of my graph. type = “l” specifies that we’d like a line graph. The argument col sets the plotted line’s color to black, and lwd sets the line width. The way the line width is scaled may vary across platforms, so you may have to do some trial and error fidgeting to get this how you want. The cex argument sets the amount by which plotted text and symbols are magnified relative to default. As with lwd, you may find that you need to go through some iterations of trial and error to get this how you want. The arguments xlab and ylab are self-explanatory; these are the x- and y-axis titles. Finally, ylim sets the y-axis limits. You can also specify breaks, but the default was fine for my purposes here, which put breaks every 2000 units.

NOTE: The arguments bg, cex, col, lty, lwd and pch can be EITHER a parameter OR passed as arguments to plotting functions. It’s not important for you to know that to understand this example, but just be aware.

Write the code for the second line. This piece of code plots a second line using the same plotting space as the first “layer.” Including par(new = T) is what accomplishes this.

par(new = T)
with(myData,
plot(VARIABLE1, COUNT2_SUBSET, col='red3',
ylim=c(0,10000), type='l', lwd=2,
axes=F, xlab=NA, ylab=NA, cex=1.2))

Here, I have disabled the axes and removed the x- and y-axis labels by setting the axes argument to FALSE and xlab / ylab to NA. I’ve also used the col argument to change the line color to red.

Rescale the third variable! The range of values in the third variable I want to plot is WAY smaller than the range for the first two. As the generic variable names suggest, COUNT1_ALL is the total number of cases at a given value of VARIABLE1, while COUNT2_SUBSET is the number of cases meeting some criteria (i.e. a subset of COUNT1_ALL). For the third line, I want it to show the proportion of all cases at each value of VARIABLE1 that meet the criteria to be included in COUNT2_SUBSET. In other words, the values shown by the third line will be COUNT2_SUBSET / COUNT1_ALL.

For the variation in the third line to be visible, this variable will need to be rescaled. The secondary axis will refer to this variable, and its scaling will match up with the primary y-axis scale. I decided that I wanted to stretch the scale of this new variable so that it covers the entire range. In other words, it won’t be scaled to any of the other variables. It will be scaled to use the entire range of the y-axis (max = 10,000). Below is the general formula for rescaling the values for a variable:

For any value of variable x…

new_varX_value = (((old_VarX_value - old_VarX_min) * new_VarX_range) / old_VarX_range) + new_VarX_min

In plain English this means that for any value of your original variable, its rescaled value will be the original value minus the minimum for the original variable multiplied by the range for the new, rescaled value (whatever you decided on). This result is then divided by the range for the original variable. Then you add the desired minimum value for the rescaled variable to that result.

Following that through with a concrete example. I want to get the new, rescaled value for a case for which the original value was 0.14. My original variable had a range of 0 to 0.25, and I would like my rescaled variable to have the same range as my graph limits (0 to 10000). The formula below yields a rescaled value (new_VarX_value) of 5600. This makes sense because that’s 56% of the new range of 10,000, and 0.14 was 56% of the old range of 0.25.

new_varX_value = (((0.14 - 0) * 10000) / 0.25) + 0

There you have it! Now we can go ahead and write the code to plot the third line. In R, because operations are carried out vector-wise by default, you don’t have to do this for EVERY value. You also don’t need to write a loop. You would just switch out the 0.14 with the old variable name. The other values you could either hard code (for simplicity) or pull them directly from the data using max and min functions.

Write the code for the third line. This piece of code is essentially identical to the code we used to plot the second line (COUNT2_SUBSET). The only difference is that we are passing our rescaled variable as the y-scale argument, and changing the line color to blue.

par(new = T)
with(myData,
plot(VARIABLE1, PERCENTAGE_COUNT2, col='blue',
type='l', axes=F, lwd=2,
xlab=NA, ylab=NA, cex=1.2))

Add the secondary y-axis and its label. The last step of building the graph is the addition of the secondary y-axis and the associated label. This is accomplished through the three code snippets below. First, I made percentage labels for use with the secondary y-axis. Next, I use axis to create the new axis and mtext to add the secondary axis title.

# Create labels
new.labs <- paste(seq(0,25,5),"%",sep="")

# Add axis
axis(side = 4,labels=new.labs,
at=seq(0,10000,2000),
col='blue', 
lwd=2, 
col.ticks = 'blue',
col.axis = 'blue')

# Add axis title
mtext(side = 4, line = 3, 'SECONDARY Y-AXIS', col='blue')

The arguments to axis are especially important. The side argument tells where to place the new axis (4 = right side). Then I use the labels argument to use my custom labels on the new axis. The at argument specifies the axis breaks where the labels will be placed. Note that although I have scaled my new variable to match the other two variables, I want the LABELS to show the intended meaning, which is PERCENTAGE. To do this I suppress the default axis labels and insert my own. I’ve also decided to use the col to make the main axis line blue, the col.ticks argument to make the axis ticks blue so they match the line, and I have set the line width for the new axis line to 2 using lwd. The argument col.axis also makes the axis labels blue.

The arguments to mtext are similar to those for axis. I’ve added the secondary axis title. The argument line tells where to place the text along the margin you’ve specified. In this case, the margin is the right side of the graph, so changing the value for line moves the secondary axis label to the left (smaller value) or right (larger value).

Add a legend. If you would like to add a legend, the code below shows how to do that. The arguments you pass to legend change the look of the legend.

legend("topleft",
legend=c("COUNT1_ALL", "COUNT2_SUBSET", "PERCENTAGE_COUNT2"),
y.intersp = 0.8,
x.intersp = 0.5,
lty=c(1,1,1),
lwd=c(2,2,2),
#pch=c(NA, NA, NA),
col=c("black", "red3", "blue"),
bty="n")

There is a lot going on in the legend. The first argument controls the legend’s placement in the figure. The totally confusing legend argument to the legend function lists the names that will be used in the entries. The arguments lty and lwd are lists of the attributes of the entries that will be shown in the legend (linetype 1 for all three entries, line width 2 for all three entries). Similarly, col lists the colors that will be used for each line in the legend. Note the commented out pch argument. You’d include that if you’d also plotted points over the lines but didn’t want to show points in the legend. The legend box was blocking one of my plot lines, so I removed the bounding box from the legend with the argument bty = “n”.

And we’re done! That was a lot of work. This is a good exercise for getting comfortable with R base plots, but I’m going to bet that the new ggplot2 method is much, much easier. Here’s the result:

example_graph

An alternative way to add lines to the original plot. One benefit of using R is that there is more than one way to do things, but this can also be confusing for beginners. Just FYI, an alternative way to add the second and third lines to the plot is to use lines instead of plot. With plot, you need to include the par(new = T) to add a new layer, but with lines you don’t. The code below should be exactly equivalent to what is described above.

# set margins
par(mar = c(5,5,2,5))

# Create the primary plot
with(myData, 
     plot(VARIABLE1, COUNT1_ALL, type="l", 
          col="black", cex=1.2, lwd=2,
          xlab="VALUE OF VARIABLE 1",
          ylab="NUMBER OF CASES",
          ylim=c(0,10000)))

# Add the second line
with(myData, 
     lines(VARIABLE1, COUNT2_SUBSET, col='red3',
          ylim=c(0,10000), lwd=2))

# Remember to rescale the values for the third line first!
# The third line uses the rescaled values

# Add third line
with(myData, 
     lines(VARIABLE1, PERCENTAGE_COUNT2, 
           col='blue', lwd=2))

# Create labels
new.labs <- paste(seq(0,25,5),"%",sep="")

# Add secondary axis
axis(side = 4,labels=new.labs, 
     at=seq(0,10000,2000),
     lwd=2,
     col='blue',
     col.ticks = 'blue',
     col.axis= 'blue')

# Add secondary y-axis title
mtext(side = 4, line = 3, 'SECONDARY Y-AXIS',
      col='blue')

# Add legend
legend("topleft",
       legend=c("COUNT1_ALL", "COUNT2_SUBSET", "PERCENTAGE_COUNT2"),
       y.intersp = 0.8,
       x.intersp = 0.5,
       lty=c(1,1,1),
       lwd=c(2,2,2),
       col=c("black", "red3", "blue"),
       bty="n")
Share
"Like" our Facebook page for more posts about applied statistics, research methods, and coding:

© AcuPsy 2015       info@acupsy.com       Load our Facebook page   Load our Twitter page   Load our LinkedIn page