2  Grammar of graphics

This chapter explains the grammar of graphics, which is a powerful model for describing a large class of data visualizations. After reading this chapter, you will be able to

2.1 History and purpose of the grammar of graphics

Most computer systems for data analysis provide functions for creating plots to visualize patterns in data. The oldest systems provide very general functions for drawing basic plot components such as lines and points (e.g. the graphics and grid packages in R). If you use one of these general systems, then it is your job to put the components together to form a meaningful, interpretable plot. The advantage of general systems is that they impose few limitations on what kinds of plots can be created. The disadvantage is that general systems typically do not provide functions for automating common plotting tasks (axes, panels, legends).

To overcome the disadvantages of these general plotting systems, charting packages such as lattice were developed (Sarkar, 2008). Such packages have several pre-defined chart types, and provide a dedicated function for creating each chart type. For example, lattice provides the bwplot function for making box and whisker plots. The advantage of such systems is that they provide a column name specification interface that simplifies creation of entire plots, including legends and panels. Crucially, this column name specification interface allows for rapid experimentation with different plot designs (for example, exchanging the variables used for legends and panels). The disadvantage is the set of pre-defined chart types, which means that it is not easy to create more complex graphics (with several layers of geoms super-imposed on top of each other, each with its own data).

Newer plotting systems based on the grammar of graphics are situated between these two extremes. Wilkinson proposed the grammar of graphics in order to describe and create a large class of plots (Wilkinson, 2005). Wickham later implemented several ideas from the grammar of graphics in the ggplot2 R package (Wickham, 2009). The ggplot2 package has several advantages with respect to previous plotting systems.

  • Like general plotting systems, and unlike lattice, ggplot2 imposes few limitations on the types of plots that can be created (there are no pre-defined chart types). So it is possible to create complex graphics, with different layers plotted on top of each other, each with its own data.
  • Unlike general plotting systems, and like lattice, ggplot2 simplifies creation of legends and panels via a column name specification interface (and so makes it easy to rapidly experiment with using different variables in different plot designs).
  • Since ggplot2 is based on the grammar of graphics, an explicit mapping of data variables to visual properties is required. Later in this chapter, we will explain how this mapping allows sketches of plot ideas to be directly translated into R code.

Finally, all of the previously discussed plotting systems are intended for creating static graphics, which can be viewed equally well on a computer screen or on paper. However, the main topic of this manual is animint2, an R package for interactive graphics. In contrast to static graphics, interactive graphics are best viewed on a computer with a mouse and keyboard that can be used to interact with the plot.

Since many concepts from static graphics are also useful in interactive graphics, the animint2 package is implemented as an extension/fork of ggplot2. In this chapter we will introduce the main features of ggplot2 which will also be useful for interactive plot design in later chapters.

In 2013, we created the animint package, which depends on the ggplot2 package. However during 2014-2017, the ggplot2 package introduced many changes that were incompatible with the interactive grammar of animint. Therefore in 2018 we created the animint2 package which copies/forks the relevant parts of the ggplot2 package. Now animint2 can be used without having ggplot2 installed. In fact, it is recommended to use library(animint2) without attaching ggplot2. However it is fine to use animint2 along with packages that import/load ggplot2. For an example, see Chapter 16, which uses the penaltyLearning package (which imports ggplot2).

2.2 Installing and attaching animint2

To install the most recent release of animint2 from CRAN,

if(!requireNamespace("animint2"))install.packages("animint2")
Loading required namespace: animint2

To install an even more recent development version of animint2 from GitHub,

if(!requireNamespace("animint2")){
  if(!requireNamespace("remotes"))install.packages("remotes")
  remotes::install_github("tdhock/animint2")
}

Once you have installed animint2, you can load and attach all of its exported functions via:

2.3 Translating plot sketches into ggplots

This section explains how to translate a plot sketch into R code. We use a data set from the World Bank as an example. We begin by loading it and examining its column names.

data(WorldBank, package="animint2")
names(WorldBank)
 [1] "iso2c"                       "country"                    
 [3] "year"                        "fertility.rate"             
 [5] "life.expectancy"             "population"                 
 [7] "GDP.per.capita.Current.USD"  "15.to.25.yr.female.literacy"
 [9] "iso3c"                       "region"                     
[11] "capital"                     "longitude"                  
[13] "latitude"                    "income"                     
[15] "lending"                    

We see 15 column names above. The WorldBank data set consist of measures such as fertility rate and life expectancy for each country over the period 1960-2010. To simplify the output below, we compute an abbreviated Region column, and consider only the subset of data which are relevant for the data visualizations below.

WorldBank$Region <- sub(
  " (all income levels)", "", WorldBank$region, fixed=TRUE)
world_bank <- subset(
  WorldBank,
  is.finite(fertility.rate) & is.finite(life.expectancy),
  select=c(Region,country,year,fertility.rate,life.expectancy))
tail(world_bank)
                  Region  country year fertility.rate life.expectancy
13032 Sub-Saharan Africa Zimbabwe 2006          3.551        44.70178
13033 Sub-Saharan Africa Zimbabwe 2007          3.491        45.79707
13034 Sub-Saharan Africa Zimbabwe 2008          3.428        47.07061
13035 Sub-Saharan Africa Zimbabwe 2009          3.360        48.45049
13036 Sub-Saharan Africa Zimbabwe 2010          3.290        49.86088
13037 Sub-Saharan Africa Zimbabwe 2011          3.219        51.23644
dim(world_bank)
[1] 9852    5

The code above prints the last few rows, and the dimension of the data table (9852 rows and 5 columns).

Suppose that we are interested to see if there is any relationship between life expectancy and fertility rate. We could fix one year, then use those two data variables in a scatterplot. Consider the figure below which sketches the main components of that data visualization.

Sketch of scatter plot based on World Bank data

The sketch above shows life expectancy on the horizontal (x) axis, fertility rate on the vertical (y) axis, and a legend for the region. These elements of the sketch can be directly translated into R code using the following method. First, we need to construct a data table that has one row for every country in 1975, and columns named life.expectancy, fertility.rate, and region. The world_bank data already has these columns, so all we need to do is consider the subset for the year 1975:

world_bank_1975 <- subset(world_bank, year==1975)
tail(world_bank_1975)
                          Region      country year fertility.rate
11623        East Asia & Pacific      Vanuatu 1975          5.929
11676        East Asia & Pacific        Samoa 1975          5.237
12524 Middle East & North Africa  Yemen, Rep. 1975          8.089
12683         Sub-Saharan Africa South Africa 1975          5.251
12895         Sub-Saharan Africa       Zambia 1975          7.435
13001         Sub-Saharan Africa     Zimbabwe 1975          7.395
      life.expectancy
11623        55.47998
11676        57.46951
12524        43.40459
12683        54.57920
12895        51.04137
13001        56.71702

The code above prints the data for 1975, which clearly has the appropriate columns, and one row for each country. The next step is to use the notes in the sketch to code a ggplot with a corresponding aes or aesthetic mapping of data variables to visual properties:

scatter <- ggplot()+
  geom_point(
    mapping=aes(x=life.expectancy, y=fertility.rate, color=Region),
    data=world_bank_1975)
scatter

The aes function is called with names for visual properties (x, y, color) and values for the corresponding data variables (life.expectancy, fertility.rate, region). This mapping is applied to the variables in the world_bank_1975 data table, in order to create the visual properties of the geom_point. The ggplot was saved as the scatter object, which when printed on the R command line shows the plot on a graphics device. Note that we automatically have a region color legend.

2.4 Rendering ggplots on web pages using animint

This section explains how the animint2 package can be used to render ggplots on web pages. The ggplot from the previous section can be rendered with animint2, by using the animint function.

animint(scatter)

If, when you run the code above, the animint does not render in your web browser for some reason (for example if you see a blank web page), then please consult our wiki FAQ which will help you find a solution. Internally, the animint function creates a list of class animint, and then R runs the print.animint function via the S3 object system. The animint2 package implements a compiler that takes the list as input, and outputs a web page with a data visualization. The compiler is the animint2dir function, which compiles the animint scatter.viz list to a directory of data and code files that can be rendered in a web browser. It is activated automatically by the print.animint function.

When viewed in a web browser, the animint plot should look mostly the same as static versions produced by standard R graphics devices. One difference is that the region legend is interactive: clicking a legend entry will hide or show the points of that color.

Exercise: try changing the aes mapping of the ggplot, and then making a new animint. Quantitative variables like population are best shown using the x/y axes or point size. Qualitative variables like lending are best shown using point color or fill.

2.5 Multi-layer data visualization (multiple geoms)

Multi-layer data visualization is useful when you want to display several different geoms or data sets in the same plot. For example, consider the following sketch which adds a geom_path to the previous data visualization.

Sketch of multi-layer visualization using World Bank data

Note how the sketch above includes two different geoms (point and path). The two geoms share a common definition of the x, y, and color aesthetics, but have different data sets. Below we translate this sketch into R code.

world_bank_before_1975 <- subset(world_bank, 1970 <= year & year <= 1975)
two.layers <- scatter+
  geom_path(aes(
    x=life.expectancy, 
    y=fertility.rate, 
    color=Region,
    group=country),
    data=world_bank_before_1975)
(viz.two.layers <- animint(two.layers))

Note that we save the return value of the animint function to the viz.two.layers object (which is also printed due to the parentheses). In this manual we will often use variable names that start with viz to denote animint data visualization objects, which are in fact lists of ggplots and options.

The plot above shows a data visualization with 2 geoms/layers:

  • the geom_point shows the life expectancy, fertility rate, and region of all countries in 1975.
  • the geom_path shows the same variables for the previous 5 years.

The addition of the geom_path shows how the countries changed over time. In particular, it shows that most countries moved to the right and down, meaning higher life expectancy and lower fertility rate. However, there are some exceptions. For example, the two East Asian countries in the bottom left suffered a decrease in life expectancy over this period. And there are some countries which showed an increased fertility rate.

Exercise: try changing the region legend to an income legend. Hint: you need to use the same aes(color=income) specification for all geoms, and you will need to use the original WorldBank data with all columns (not world_bank which has a limited number of columns). You may want to use scale_color_manual with a sequential color palette, see RColorBrewer::display.brewer.all(type="seq") and read the appendix for more details.

Can we add the names of the countries to the data viz? Below, we add another layer with a text label for each country’s name.

three.layers <- two.layers+
  geom_text(aes(
    x=life.expectancy, 
    y=fertility.rate, 
    color=Region,
    label=country),
    data=world_bank_1975)
animint(three.layers)

This data viz is not so easy to read, since there are so many overlapping text labels. The interactive region legend helps a little, by allowing the user to hide data from selected regions. However, it would be even better if the user could show and hide the text for individual countries. That type of interaction can be achieved using the showSelected and clickSelects parameters, which we explain in Chapters 3-4.

Exercise: Re-make this data visualization using aes(tooltip), which is a new feature in animint2 (not present in ggplot2), and is discussed in Chapter 5. Set aes(tooltip=country) so that the country name will be visible when you hover the cursor over the corresponding geom.

Next, we move on to discuss a major strength of animint: data visualization with multiple linked plots.

2.6 Multi-plot data visualization

Multi-plot data visualization is useful when you want to show some related data sets using more than one aesthetic mapping. In interactive data visualization, one plot is often used to display a summary, and another plot is used to display details. For example, consider a data visualization with two plots: a time series with World Bank data from 1960-2010 (summary), and a scatterplot with data from 1975 (details). We sketch the time series plot below.

Sketch of visualization with two plots based on World Bank data

Note how the sketch above can be directly translated into the R code below. We copy the existing viz list (viz.two.layers) to a new list (viz.two.plots), then we assign a ggplot to a new element named timeSeries.

viz.two.plots <- viz.two.layers
viz.two.plots$timeSeries <- ggplot()+
  geom_line(aes(
    x=year, 
    y=fertility.rate, 
    color=Region, 
    group=country),
    data=world_bank)

That results in a named list of two elements (both elements are ggplots with class gganimint).

summary(viz.two.plots)
           Length Class     Mode
plot1      9      gganimint list
timeSeries 9      gganimint list

This data visualization list can be printed/rendered by typing its name. Since the list contains two ggplots, animint2 renders the data viz as two linked plots.

viz.two.plots

The data visualization above contains two ggplots, which each map different data variables to the horizontal x axis. The time series uses aes(x=year), and shows a summary of fertility rate values over all years. The scatterplot uses aes(x=life.expectancy), and shows details of the relationship between fertility rate and life expectancy during 1975.

Try clicking a legend entry in either the scatterplot or the time series above. You should see the data and legends in both plots update simultaneously. Since aes(color=Region) was specified in both plots, animint creates a single shared selector variable called Region. Clicking either legend has the effect of updating the set of selected regions, and so animint updates the legends and data in both plots accordingly. This is the main mechanism that animint uses to create interactive data visualizations with linked plots, and will be discussed in more detail in the next two chapters.

Exercise: use animint to create a data viz with three plots, by creating a list with three ggplots. For example, you could add a time series of another data variable such as life.expectancy or population.

Note that both ggplots map the fertility rate variable to the y axis. However, since they are separate plots, the ranges of their y axes are computed separately. That means that even when the two plots are rendered side-by-side, the two y axis are not exactly aligned. That is a problem since it would make it easier to decode the data visualization if each unit of vertical space was used to show the same amount of fertility rate. To achieve that effect, we use facets in the next section.

2.7 Multi-panel data visualization (facets)

Panels or facets are sub-plots that show related data visualizations. One of the main strengths of ggplots is that different kinds of multi-panel plots are relatively easy to create. Multi-panel data visualization is useful for two different purposes:

  • You want to align the axes of several related plots containing different geoms. This facilitates comparison between several different geoms, and is a technique that is also useful for interactive data visualization.
  • You want to divide the data from one geom into several panels. This facilitates comparison between data subsets, and is less useful for interactive data visualization (interactivity can often be used instead, to achieve the same effect of comparing data subsets).

2.7.1 Different geoms in each panel (aligned axes)

We begin by explaining the how facets are useful to align the axes of related plots. Consider the sketch below which contains a plot with two panels.

Sketch of multi-panel visualization of World Bank data using different geoms in each panel (aligned axes)

Note that the two panels plot different geoms using a panel-specific aesthetic mapping. The point and path in the left panel have x=life.expectancy, and the line in the right panel has x=year. Also note that we specified facet=x.var, so we need to add a variable called x.var to each of the three data sets. We translate this sketch to the R code below.

add.x.var <- function(df, x.var){
  data.frame(df, x.var=factor(x.var, c("life expectancy", "year")))
}
(viz.aligned <- animint(
  scatter=ggplot()+
    theme_bw()+
    theme_animint(width=600)+
    theme(panel.margin=grid::unit(0, "lines"))+
    geom_point(aes(
      x=life.expectancy, y=fertility.rate, color=Region),
      data=add.x.var(world_bank_1975, "life expectancy"))+
    geom_path(aes(
      x=life.expectancy, y=fertility.rate, color=Region,
      group=country),
      data=add.x.var(world_bank_before_1975, "life expectancy"))+
    geom_line(aes(
      x=year, y=fertility.rate, color=Region, group=country),
      data=add.x.var(world_bank, "year"))+
    xlab("")+
    facet_grid(. ~ x.var, scales="free")))

The data visualization above contains a single ggplot with two panels and three layers. The left panel shows the geom_point and geom_path, and the right panel shows the geom_line. The panels have a shared axis for fertility rate, which ensures that the lines in the time series panel can be directly compared with the points and paths in the scatterplot panel.

Note that we used the add.x.var function to add a x.var variable to each data set, and then we used that variable in facet_grid(scales="free"). We call this the addColumn then facet idiom, which is generally useful for creating a multi-panel data visualization with aligned axes. In particular, if we wanted to change the order of the panels in the data visualization, we would only need to edit the order of the factor levels in the definition of add.x.var.

Also note that theme_bw means to use black panel borders and white panel backgrounds, and panel.margin=0 means to use no space between panels. Eliminating the space between panels means that more space will be used for the panels, which serves to emphasize the data. We call this the Space saving facets idiom, which is generally useful in any ggplot with facets.

2.7.2 Same geoms in each panel (compare data subsets)

The second reason for using plots with multiple panels in a data visualization is to compare subsets of observations. This facilitates comparison between data subsets, and can be used in at least two different situations:

  • One geom’s data set has too many observations to display informatively in one panel.
  • You want to compare different subsets of data that is plotted for one geom.

For example, consider the sketch below.

Sketch of multi-panel visualization with same geoms in each panel, to compare World Bank data in different years

Note that the three panels plot the same two geoms (point and path). Since facet=show.year, and there are three panels shown, we will need to create data tables which have three values for the show.year variable. The geom_point has data for just 3 years, and the geom_path has data for 15 years (but 3 values of show.year). The code below creates these two data sets for three years of the world_bank data set.

show.point.list <- list()
show.path.list <- list()
for(show.year in c(1975, 1985, 1995)){
  show.point.list[[paste(show.year)]] <- data.frame(
    show.year, subset(world_bank, year==show.year))
  show.path.list[[paste(show.year)]] <- data.frame(
    show.year, subset(
      world_bank, show.year - 5 <= year & year <= show.year))
}
show.point <- do.call(rbind, show.point.list)
show.path <- do.call(rbind, show.path.list)

We used a for loop over three values of show.year, the variable which we will use later in facet_grid. For each value of show.year, we store a data subset as a named element of a list. After the for loop, we use do.call with rbind to combine the data subsets. This is an example of the list of data tables idiom, which is generally useful for interactive data visualization.

Below, we facet on the show.year variable to create a data visualization with three panels.

animint(
  scatter=ggplot()+
    geom_point(aes(
      x=life.expectancy, y=fertility.rate, color=Region),
      data=show.point)+
    geom_path(aes(
      x=life.expectancy, y=fertility.rate, color=Region,
      group=country),
      data=show.path)+
    facet_grid(. ~ show.year)+
    theme_bw()+
    theme_animint(width=600))

The data visualization above contains a single ggplot with three panels. It shows more of the world_bank data set than the previous visualizations which showed only the data from 1975. However, it still only shows a relatively small data subset. You may be tempted to try using a panel to display every year (not just 1975, 1985, and 1995). However, beware that this type of multi-panel data visualization is especially useful if there are only a few data subsets. With more than about 10 panels, it becomes difficult to see all the data at once, and thus difficult to make meaningful comparisons.

Instead of showing all of the data at once, we can instead create an animated data visualization that shows the viewer different data subsets over time. In the next chapter, we will show how the new showSelected keyword can be used to achieve animation, and reveal more details of this data set.

2.8 Chapter summary and exercises

This chapter presented the basics of static data visualization using ggplot2. We showed how animint can be used to render a list of ggplots in a web browser. We explained two features of ggplot2 that make it ideal for data visualization: multi-layer and multi-panel graphics.

Exercises:

  • What are the three main advantages of ggplot2 relative to previous plotting systems such as grid and lattice?

  • What is the purpose of multi-layer graphics?

  • Create a version of viz.two.layers with aes(tooltip) computed based on the min/max values of the data shown by the geom_path. Hint: for each country in world_bank_before_1975, compute a text string to use for aes(tooltip). One way to do this is via data.table(world_bank_before_1975)[, .(tooltip=sprintf(...)), by=country].

  • What are the two different reasons for creating multi-panel graphics? Which of these two types is more useful with interactivity?

  • Let us define “A < B” to mean that “one B can contain several A.” Which of the following statements is true?

    • ggplot < panel
    • panel < ggplot
    • ggplot < animint
    • animint < ggplot
    • layer < panel
    • panel < layer
    • layer < ggplot
    • ggplot < layer
  • In the viz.aligned facets, why is it important to use the scales="free" argument?

  • In viz.aligned we showed a ggplot with a scatterplot panel on the left and a time series panel on the right. Make another version of the data visualization with the time series panel on the left and the scatterplot panel on the right.

  • In viz.aligned the scatterplot displays fertility rate and life expectancy, but the time series displays only fertility rate. Make another version of the data visualization that shows both time series. Hint: use both horizontal and vertical panels in facet_grid.

  • Use aes(size=population) in the scatterplot to show the population of each country. Hint: scale_size_animint(pixel.range=c(5, 10) means that circles with a radius of 5/10 pixels should be used represent the minimum/maximum population.

  • Create a multi-panel data visualization that shows each year of the world_bank data set in a separate panel. What are the limitations of using static graphics to visualize these data?

  • Create viz.aligned using a plotting system that is not based on the grammar of graphics. For example, you can use functions from the graphics package in R (plot, points, lines, etc), or matplotlib in Python. What are some advantages of ggplot2 and animint?

Next, Chapter 3 explains the showSelected keyword, which indicates a variable to use for subsetting the data before plotting.