Speaking the Language of Data Visualization: The Grammar of Graphics
This article is the third part of a series on data visualization as essential knowledge for health information (HI) professionals. Part 1 discussed six principles for building effective and accurate data visualizations. Part 2 introduced the neuropsychological concepts and theories behind data visualization that help build a solid foundation for data visualization projects.
Data visualization is the graphic display of quantitative data for the purposes of helping an audience better understand and gain insights from the data. Creating an effective and engaging data visualization is often both an art and science.
This article discusses the science—the grammar of graphics, a formal framework that speaks to the components and process of data visualization. An understanding of this framework is helpful for anyone creating data visualizations and for their audience.
Like grammar in language, the grammar of graphics defines essential components of a graphic and the rules and steps to prepare and assemble these components into the final presentation of data. Familiarity with the grammar of graphics can help you to dissect a graphic by its components and build data graphics in a versatile and modular way.
The original grammar of graphics was proposed by American statistician and computer scientist Leland Wilkinson in 1999, and the layered grammar of graphics was further developed by New Zealander statistician Hadley Wickham in 2010.
Wilkinson proposed the original grammar of graphics model in his book of the same name. One of Wilkinson’s motivations for writing the book was to fill the gap in the approach to data visualization using computer software, solely based on the initial choice of the chart. We still see that approach in software like Microsoft Excel when creating charts. If a user needs to plot data using a scatterplot, the computer software will define each detail of drawing the plot. If the same user decides to plot the same data but in a different plot type (e.g., bar plot), then another computer function will have to define all the details for the new plot again.
Finding this approach inefficient and rigid, Wilkinson borrowed ideas from the object-oriented programming paradigm and described key components of a data graphic. He argued that once these components are defined and created, the final graphic can be assembled from these components using computer software. Using this model, we only need to switch individual components of a graphic, not the entire program, when changes are required in a data plot. Wilkinson specified the following six components in his model:
- Data: The “data” component includes the data to be visualized. It could be a subset of data derived from a larger dataset.
- Trans: “Trans” stands for transformation—the process of transforming data to be visualized. For example, data are ranked by counts or average values before displaying on a bar plot.
- Scale: The “scale” component defines scale transformation. An example is the log transformation of the scales of a Cartesian coordinate system if the data are skewed in a non-linear way.
- Coord: This component defines the frame or coordinate system used for the graphics. The two-dimensional Cartesian coordinate system is commonly used to display values from both X and Y axes. The pie chart is an example of using the polar coordinate system, while values are represented as degrees of angles and radius.
- Element: The “element” component includes all the graphic types (e.g., points, bars) and aesthetic attributes (e.g., color, shape). It is where the encoding of quantitative data in different visual elements happens.
- Guide: This is the guiding component of all the graphs. It includes axes, legends, and annotation lines (e.g., regression line). “Guide” provides contextual information for the graphics.
Wilkinson’s grammar of graphics was well received and adopted in the research and development of data visualization. For example, the business intelligence software Tableau, which was used to generate the graphics in the example in this article, employed Wilkinson’s model, allowing graphic components to be added independently when creating the data visualization.
Wickham later extended Wilkinson’s model by introducing a layered grammar of graphics, which became the basis of the popular R statistical package ggplot2 (“gg” stands for the grammar of graphics) that he developed to put his model in practice. The layered grammar of graphics specifies the process of data visualization as stacking layers of components for the final product. The components of Wickham’s model are:
- A dataset. This is the dataset used for creating the final data visualization. It could be a subset of a large dataset.
- A set of mappings from variables to aesthetics. Aesthetics is the visual artifacts that could be perceived: positions, colors, sizes, etc.
- One or more layers. Each layer has one geometric object, one statistical transformation, one position adjustment, and, optionally, one dataset and set of aesthetic mapping.
- Geometric objects – bars, lines, dots, etc.
- Statistical transformations – calculating the statistics (count, average, ranks, etc.) of selected variables to be displayed. For example, counts of instances within each predefined bin of a continuous variable are calculated to create a histogram of the variable.
- Position adjustment – Due to limited space for plotting data, sometimes the individual elements of a data visualization could obstruct each other. For example, a scatterplot of many data could obscure the individual dots. Some manipulations of the position of these elements will be needed (e.g., stacked bar chart, jitter plot) to maximize the display of these elements.
- One scale for each aesthetic mapping used. Computers need instruction to convert the original data units into graphical units (pixels, RBG numbers). Most of the modern visualization software can automatically perform scaling.
- A coordinate system. This defines how the space for the data visualization (e.g., Cartesian versus polar coordinates) is to be drawn.
- The facet specification. When the number of variables becomes too many to plot all on the same graphic, we can draw multiple subplots by conditioning the plotting on one or multiple variables. Using facets, we can create rows and columns of subplots.
The two data visualizations below illustrate how the layered grammar of graphics can be used to describe and construct a data visualization. Both data visualizations are based on the same trivial fictional data in Table 1, containing data for six variables (Age, Gender, Race, Body Mass Index [BMI], Hemoglobin A1c [HbA1c], and Highest Educational Level) for eight patients.
Table 1: Fictional Patient Data
Patient |
Age |
Gender |
Race |
BMI |
Hemoglobin A1c |
Highest Educational Level |
1 |
25 |
F |
W |
20 |
4.1 |
High School |
2 |
45 |
M |
NW |
30 |
6.1 |
College |
3 |
37 |
F |
W |
26 |
4.7 |
College |
4 |
27 |
M |
NW |
25 |
3.9 |
High School |
5 |
41 |
M |
W |
29 |
7.9 |
High School |
6 |
31 |
M |
NW |
26 |
3.9 |
College |
7 |
59 |
F |
W |
31 |
6.4 |
College |
8 |
56 |
F |
NW |
27 |
7.3 |
College |
All data are fictional. F = Female, M = Male, W = White, NW = Non-White
Figure 1 is a simple scatterplot to visualize the relationship between the Age and BMI variables. Figure 2 is more complex, fitting more variables into the 2D scatter plot using additional aesthetic mappings. Facet was used to split the data visualization by conditioning on the variable Race. Table 2 compares the differences between the two data visualizations following the layered grammar of graphics.
Figure 1: Scatterplot of Age and BMI
Figure 2: Plot of Age, BMI, HbA1c, Gender, Highest Educational Level, and Race
Table 2: Analysis of Data Visualization Using the Layered Grammar of Graphics
Figure 1 |
Figure 2 |
|
Data |
BMI, Age |
BMI, Age, Race, HbA1c, Highest Educational Level
|
Mapping |
Age à X Position BMI à Y Position |
Age à X Position BMI à Y Position HbA1cà Size of dot Gender à Color of dot Highest Educational Level à Shape of dot
|
Layer |
Geometricà Dot Statistical Transformation à None Position Adjustment à None |
Geometric à Dot Statistical Transformation à None Position Adjustment à None
|
Scale |
X Position, Y Position |
X Position, Y Position, Size (radius), Color (orange and dark blue), Shape (circle and square)
|
Coordinate System |
2-Dimensional Cartesian Coordinate |
2-Dimensional Cartesian Coordinate
|
Facets |
None |
Facets are conditioned on the variable Race. A one-by-two subpanel shows the difference of the scatterplots between the white and non-white racial groups. |
Using This Framework in Practice
While data visualization grammar and graphics were proposed and defined to help engineers develop software to assemble and render data visualization by components, understanding this framework can help data visualizers to create layered graphics to display multidimensional data. Likewise, if you are an audience of data visualization, familiarity with the grammar of graphics helps you better understand even complex data graphics better, especially the complex ones. Health information professionals in practice often encounter multidimensional data with both categorical and continuous data types. A grammar of graphics framework can serve as a blueprint to design the visualization of such complex data in a layered way.
Xiaoming Zeng (xiaoming_zeng@med.unc.edu) is a research professor in the Department of Psychiatry at the University of North Carolina at Chapel Hill.
Katelyn H. Rouse (hardyka16@ecu.edu) is a clinical assistant professor in the Department of Health Services and Information Management at East Carolina University.