13 | March | 2010 | Please Scoop Me!

Eytan and I have been actively exploring lots of crazy new ideas in jjplot, a new plotting library for R. Here’s a quick rundown of recent changes. We’d love to hear what you guys think

Formulae. The old way of expressing the series of geoms and stats that form the plot was cumbersome. Putting a series of commands in the … leads to annoying issues such as poorer error handling. More importantly, because it can only express a series of statements, it becomes unclear which stats affect which geoms, making it impossible to express more complicated combinations.
We believe that formulae are a good solution to this. Layers are separated by ‘+’ operations. Interactions between stats and geoms are expressed via the interaction operator ‘:’. This allows us to gracefully express arbitrary trees of stats and geoms. An example of a jittered scatter plot:

Old:
jjplot(x = x, y = y, data = data, jjplot.jitter(xfactor=1), jjplot.point())
New:
jjplot(y ~ point() : jitter(xfactor = 1) + x, data = data)
The leftmost and rightmost terms correspond to the y and x aesthetics. For a simple case such as this, formulae might not seem like much of an improvement. But consider a more complex example:
jjplot( ~ line(lty="dashed", col = "red") : hist() + bar(width = 0.1) : hist() : jitter(xfactor = 1) + Sepal.Length, data = iris)

Reading from the right, this says to take iris$Sepal.Length, jitter it, bin the data, and bar plot the result. This is cool because it’s immediately clear that you’re stacking stats, plotting a histogram of the jittered data. The first term does the same thing, except that it does a hist() statistic without the jitter, and draws this as a red line.

By using parentheses, you can also apply a stat to multiple stats/geoms.
jjplot( ~ (point(col = "blue", size=3) + line(col = "red", lty="dashed") + bar(width=0.25)) : hist() + Petal.Length, data = iris)
Here we’re just plotting a histogram but with some extra geoms on top for some extra flair.

We think this notation is a simple and elegant way of expressing what interacts with what.
Facets This way of thinking about facets is somewhat controversial among us. Normally, facets conflate two concepts: how you compute statistics and how you plot them. This means that you compute statistics on facet subsets, then you plot each subset in a separate panel. Well, currently jjplot takes a different tack, treating facets as merely a command to plot different subsets of the data in different panels. To see what this implies, consider
df <- data.frame(state = rownames(state.x77), region = state.region, state.x77) jjplot(Murder ~ abline(lty = "dashed") : fit() + abline() : group(fit(), by = region) + point() + Income, data = df, color = region, facet = region)

The first two terms simply do a scatter plot. The next line does lm fits on each subset. Note that you have to be explicit with the grouping. With old semantics, you’d have an implicit group by on the facet variable, but because we aren’t combining the grouping and the faceting anymore, you have to spell it out. The first line shows you the effect of leaving out the grouping operator: you get a fit over all the data that appears on all panels. This is something I’ve always wanted to do and it seems to also be persistent question on stack overflow (e.g., “how do I draw a line at the facet/global mean on each facet panel?”). Hopefully this formulation makes it obvious.
Sorting Another persistent question is how to perform sorting on factor scales. Because of the ease of stacking stats in the formula formulation, we think it makes sense to add a few special stats/geoms. One of them is the sort stat. This performs an identity operation on the data frame but also appends some metadata about how to order things which is then intercepted when the scales are created. Here are some usage examples:
df <- data.frame(name = factor(letters), value = rnorm(26 * 6), type = rep(factor(month.name[1:6]), each = 26)) jjplot(name ~ point() + value, data = df, color = type, facet = type)
The first plot is the data unsorted.

jjplot(name ~ point() : sort(y = value) + value, data = df, color = type, facet = type)
The second plot sorts according to the mean value associated with each factor across all facets (remember no grouping!). Like relevel, the sort statistic can take a function argument to specify how multiple points should be sorted.

jjplot(name ~ point() : group(sort(y = value), by=type) + value, data = df, color = type, facet = type)
The last plot wraps the sort in a group by, meaning that each facet panel has its own sorting order.