After the introduction of the basic data visualization techniques, we are now ready to discuss the principles and guidelines in building data visualization.
There are many existing principles/rules for data visualization available in the literature. Some of them have withstood the test of time, while others are not very relevant in the modern age. Meanwhile, every rule has its own limitation and applicable situation, and each situation is different. Hence it is hard to summarize a few universally applicable principles/rules. This does not mean you can do whatever you want, but instead implies that you should adapt to the context and situation. Instead of blindly trusting a few principles, you should use your own instinct and common sense as the guidance. For example, if you think of yourself as the audience, you should be able to tell where the current visualization need improvement.
We use the following R packages in this chapter.
library(tidyverse)
library(grid)
library(gridExtra)
library(gapminder)
library(GGally)
library(RColorBrewer)
To visualize a data set, one important task is to choose an appropriate visualization type, for example, scatter plot, bar plot, and etc. This is essentially to select the geometry layer in ggplot() function. This selection can be hard because there are many different types of visualization. Each one is designed for a particular purpose and a particular type of data. We will discuss these types of visualization in details in the next a few chapters. In this chapter, we mostly focus on the distinction among these visualizations, so that we can select the appropriate visualization type. To select an appropriate visualization, we need to consider the following factors:
For example, when we consider taking the log transformation on your data before visualization, it may be easy and straightforward to plot the log-transform data directly on the axes if it is for our own exploration. However, for a general audience that is unfamiliar with log transformation and cannot easily convert logged values back to the original values, using the axis with the original value tick labels instead of log-transformed value tick labels will be much easier.
The table below summarizes various of visualization types according to the three questions above. What is the purpose of the visualization? To display a distribution, to compare distributions, or to show association? What variables are to be visualized? Continuous, categorical, mixed, time series, spatial, and etc.? How many variables? The following table gives you an starting point for searching the right visualization type.
Visualization type | Purpose | Continuous | Discrete | Mixed (con. + dis.) |
---|---|---|---|---|
histogram | distribution | 1 | ||
density plot | distribution | 1 | ||
violin plot | distribution | 1 | ||
boxplot | distribution | 1 | ||
strip plot | distribution | 1 | ||
beeswarm plot | distribution | 1 | ||
sina plot | distribution | 1 | ||
QQ plot | distribution | 1 | ||
CDF plot | distribution | 1 | ||
stacked histogram | distribution, composition | 1+1 | ||
stacked density | distribution, composition | 1+1 | ||
overlapping histogram | distribution, comparison | 1+1 | ||
overlapping density | distribution, comparison | 1+1 | ||
ridgeline plot | distribution, comparison | 1+1 | ||
side-by-side violin | distribution, comparison | 1+1 | ||
side-by-side boxplot | distribution, comparison | 1+1 | ||
side-by-side strip plot | distribution, comparison | 1+1 | ||
side-by-side beeswarm plot | distribution, comparison | 1+1 | ||
side-by-side sina plot | distribution, comparison | 1+1 | ||
overlapping CDF | distribution, comparison | 1+1 | ||
bar plot | composition | 1 | 1+1 w/ geom_col() |
|
stacked bar plot | composition, comparison, association | 2 | ||
dodged bar plot | composition, comparison, association | 2 | ||
ternary plot | composition | 1 | ||
pie chart | composition | 1 | ||
donuts chart | composition | 1 | ||
side-by-side pie charts | composition, comparison | 2 | ||
side-by-side donuts charts | composition, comparison | 2 | ||
parallel sets | composition, association | >=2 | ||
mosaic plot | composition, association | >=2 | ||
tree map | composition, association | >=2 hierarchical | ||
heatmap | distribution, association | 2 (wide), 3 (long) | 2 (wide), 3 (long) | 1,2 + 1,2 |
scatter plot | association | 2 | ||
3D scatter plot | association | 3 | ||
colored scatter plot | association | 3 | 2+1 | |
facet scatter plot | association, comparison | >=3 | 2+1,2 | |
bubble plot | association | 3 | 2+1 | |
density contour | association | 2 | ||
2d histogram | association | 2 | ||
hex histogram | association | 2 | ||
correlogram | association, comparison | >=2 | ||
parallel coordinates | association | >=2 | ||
bullet graph | comparison | 5 | ||
Gantt chart | distribution | >=2 | ||
population pyramid | distribution | 2 | ||
surface plot | distribution | 3 | ||
radviz chart | distribution | >=3 | ||
dendrogram | distribution | >=1 | ||
barcode plot | distribution | >=1 | ||
voronoi diagram | composition, distribution | >=2 | ||
Nightingale rose chart | composition | 1+1 | ||
sunburst | composition | 1+1,2,3… | ||
Sankey diagram | composition, association | 1+1,2,3… | ||
Cleveland dot/lollipop plot | composition, distribution | 1+1 | ||
slope graph | association | 1+1 | ||
spider/radar/star plot | display,compare | 1+1 | ||
dumbbell/arrow plot | display,compare | 2+1 | ||
circle packing | composition | 1+1 | ||
line plot | distribution | time series | ||
connected scatter plot | association | time series | ||
dual-axis line plot | distribution | time series | ||
area plot | distribution | time series | ||
stacked area plot | distribution | time series | ||
shaded plot | distribution | time series | ||
stream/river graph | distribution | time series | ||
horizon graph | distribution | time series | ||
calendar-based graphics | distribution | time series | ||
spiral plot | distribution | time series | ||
map | distribution | spatial data | ||
choropleth map | distribution | spatial data | ||
cartogram heatmap | distribution | spatial data | ||
facet map | distribution | spatial data | ||
connection map | distribution | spatial data | ||
arc diagram | distribution | network data | ||
chord diagram | distribution | network data | ||
word cloud | distribution | text data |
Some general recommendations:
Note that there are always some exceptions to these rules introduced above. So the selection of the visualization type really depends on the particular situation. It is important to try multiple visualization types before finalize one particular type.
There are many other resources for visualization type categories. The following websites provide many different types of visualizations.
There are also excellent resources on visualization selection principles, for example, The Graphic Continuum and Chart Suggestions — A Thought-Starter
Rome wasn’t built in a day. When constructing a visualization, you should take the trial and error approach. Start with something basic and gradually add on more features and remove the necessary components. During this iterative process, let your instinct be your guide.
Among the visualizations mentioned in the previous section, there are some visualization types to be avoided.
Instead of the pie chart or the donut chart, we should use bar plot to represent the composition of a categorical variable and show comparison. The pie chart is not preferred because human vision is not as sensitive to (unaligned) angels as to length. For example, the pie chart cannot easily distinguish small difference in proportion, such as 45 degree and 46 degree, even when the two slices of pie are next to each other. The bar plot on the other hand can easily show which category is slight more when the bars are aligned at the bottom. Below is the visualization of the proportion of colleges in each state among five midwest states in USA. The pie chart and donut chart do not present a clear picture about the magnitude, whereas the bar plot displays the amounts easily.
The 3D visualization should generally be avoided since most of the visualization is on a screen or paper. For example, in a 3D scatter plot, the extra dimensions usually cannot be efficiently represented (see below). It is hard to identify the exact location of each dot/observation and creates visual burdens to the analyst. One quick way to fix 3D visualization is to make it interactive. We have shown a few examples in the chapter where we develop visualization for high-dimensional data. However, they are often much harder to generate and cannot be printed on paper.
Below is a comparison of 3D line plot and 2D line plot. The survival rates are plotted against the logarithm of dose for different drug types. In the 3D visualization, it is hard to determine the values of the survival rate, especially when the ribbons intersect with each other. Alternatively, we can convert the visualization to 2D, which is more clear and easy to understand.
Sometimes, the 3D visualization even provides misleading visualization. Below is an example, because of the 3D effect, the last two year’s number looks like a big jump, whereas the reality is that the number only slightly increases.
In addition, the dual-axis line plot should be avoided because it is confusing to readers to distinguish which series is for which axis.
Lastly, the stacked area plot should be avoided because it is hard to know whether the area is stacked or overlapped.
Visualization may not be preferred in all the situations. For example, if the analyst would like to know the actual values, then tables are preferred over charts. In addition, if the numbers are in different scales, i.e., millions, billions, and tens, then tables can efficiently display these values.
On the other hand, actual values displayed in R has a default number
of digits of 7, which is often too many. The redundant digits creates
burdens for the analyst when interpreting the results. As long as the
number of digits is enough to distinguish most of the numbers, it is
usually sufficient. Useful functions to set the number of significant
digits or to round numbers are signif
and
round
. The number of significant digits can be set globally
by options(digits = 3)
.
options("digits")
## $digits
## [1] 7
1/3
## [1] 0.3333333
options(digits=3)
1/3
## [1] 0.333
round(1/3,digits = 1)
## [1] 0.3
Here is an example of tables with too many digits and improved version. Note that the it is usually better to compare numbers when they are aligned vertically, i.e., in the same column, than horizontally. This is because the lengths of the numbers and decimal digits are easier to see.
state | year | Measles | Pertussis | Polio |
---|---|---|---|---|
California | 1940 | 37.8826320 | 18.3397861 | 0.8266512 |
California | 1950 | 13.9124205 | 4.7467350 | 1.9742639 |
California | 1960 | 14.1386471 | NA | 0.2640419 |
California | 1970 | 0.9767889 | NA | NA |
California | 1980 | 0.3743467 | 0.0515466 | NA |
state | year | Measles | Pertussis | Polio |
---|---|---|---|---|
California | 1940 | 37.9 | 18.3 | 0.8 |
California | 1950 | 13.9 | 4.7 | 2.0 |
California | 1960 | 14.1 | NA | 0.3 |
California | 1970 | 1.0 | NA | NA |
California | 1980 | 0.4 | 0.1 | NA |
As an example, here are the per 10,000 disease rates, computed from totals and population in R, for California across the five decades:
We are reporting precision up to 0.00001 cases per 10,000, a very small value in the context of the changes that are occurring across the dates. In this case, two significant figures is more than enough and clearly makes the point that rates are decreasing:
The visualization should represent the data faithfully in the sense
that the true effect/variability in the data is proportional to the
effect/variability shown in the visualization.
However, in practice, due to various reasons, distortion often take
place without being noticed. Conceptually, we can define the lie factor
as
\[ \text{Lie Factor}= \frac{\text{size of effect shown in figure}}{\text{size of effect in data}} \]
Ideally, the lie factor should be close to 1. Deviation from 1 in either direction should be avoided. Usually, the design of the axis is one of the most common place where lie factor is blown up.
For example, when using bar plots, it is misleading to start the bars not at 0. This is because that using position to visualize only require all positions to be in a same, aligned the coordinate system. If you really do not want to start at 0, then use scatter plot instead of bar plot.
Another example is to use the area of circle to represent quantity. Intuitively, the area of the circle should correspond to and be proportional to the quantity. However, the radius is often used instead of area. This will give audience a false impression. See the example below.
Fortunately, ggplot2 defaults to using area rather than radius. Additionally, we could use position and length to represent the same data even better.
gdp <- c(14.6, 5.7, 5.3, 3.3, 2.5)
gdp_data <- data.frame(Country = rep(c("United States", "China", "Japan", "Germany", "France"),3),
y = factor(rep(c("Radius","Area","GDP"),each=5), levels = c("Radius", "Area", "GDP")),
GDP= c(gdp^2/min(gdp^2), gdp/min(gdp), gdp)) %>%
mutate(Country = reorder(Country, GDP))
g1 = gdp_data %>%
filter(y != "GDP") %>%
ggplot(aes(Country, y, size = GDP)) +
geom_point(show.legend = FALSE, color = "blue") +
scale_size(range = c(1,5.84)*4.2) +
ylab("") + xlab("")
g2 = gdp_data %>%
filter(y == "GDP") %>%
ggplot(aes(Country, GDP)) +
geom_bar(stat = "identity", width = 0.5) +
ylab("GDP in trillions of US dollars")
grid.arrange(g1,g2,ncol=1)
This visualization principal is essentially the proportional ink principal. It means that the representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities measured.
Different aesthetic dimensions have different expressive power in
visualizing data. Human vision is most sensitive to lengths and
positions than to areas and colors. Therefore, when choose the aesthetic
mapping, we should use the more sensitive aesthetic dimensions for the
most important variables.
For example, in scatter plot, the first two dimensions are the most
important part.
Color and size are secondary.
For continuous variable, the aesthetic dimensions are
For discrete variable, the aesthetic dimensions are
Comparison is the basis for scientific discovery. The patterns in visualization is easy to assess when compared with a reference level. When visualizing the data, providing a reference level will help the audience understand the significance of the pattern. When you make a claim from the data visualization, your statement should always be relative. For example, relative to a baseline level, relative to healthy individuals, relative to a city, and etc. Therefore, we should always compare two patterns instead of visualizing just one.
When comparing two subset of data, remember to use the common axes and common ranges. Also use the vertical layout for the common horizontal axis and the horizontal layout for the common vertical axis.
Another important issue is to always try to show data. Comparing visualization or summary statistics are helpful, but the actual data make the comparison more details. When comparing densely distributed data points, we can often use jittering or make data symbol more transparent.
library(tidyverse)
g1 = ggplot(mpg)+
geom_density(aes(x=hwy))+
facet_wrap(~drv, ncol = 3, scales="free_x")
g2 = ggplot(mpg)+
geom_density(aes(x=hwy))+
facet_wrap(~drv,ncol = 1, scales="free_x")
g3 = ggplot(mpg)+
geom_density(aes(x=hwy))+
facet_wrap(~drv,ncol = 1)
g4 = ggplot(mpg)+
geom_boxplot(aes(x=hwy,y=drv))+
geom_jitter(aes(x=hwy,y=drv), alpha=0.5)
library(gridExtra)
plots=list(g1,g2,g3, g4)
layout_mat = rbind(c(1, 1, 1),
c(2, 3, 4))
grid.arrange(grobs = plots,
layout_matrix = layout_mat, # matrix of layout
widths = c(1, 1, 1),heights = c(0.75, 1.5))
When comparing quantities of different scales, it is usually useful to consider the log or square root transformations. In addition, you can incorporate a hierarchy of comparisons, for example, across different continents and then across different years.
library(dslabs)
data("gapminder")
g1 = gapminder %>%
filter(year==2010) %>%
ggplot() +
geom_jitter(aes(x=continent, y=population),width = 0.2)
g2 = gapminder %>%
filter(year==2010) %>%
ggplot() +
geom_boxplot(aes(x=continent, y=population)) +
geom_jitter(aes(x=continent, y=population),width = 0.2) +
scale_y_continuous(trans = "log10", breaks = 10^(5:9), labels = format(10^(5:9),
scientific = FALSE,
big.mark = ',' ))
g3 =
gapminder %>%
filter(year %in% c(1975,2010)) %>%
ggplot() +
geom_boxplot(aes(continent, population, fill = factor(year))) +
scale_y_continuous(trans = "log10", breaks = 10^(5:9), labels = format(10^(5:9),
scientific = FALSE,
big.mark = ',' ))
grid.arrange(g1,g2,g3,ncol=1)
One important question in data visualization is to select what variables to plot. To illustrate this issue, we start with a simple example of a scatter plot. How much information can a scatter plot show? Suppose our data has 6 variables V1 through V6 (simulated from a multivariate normal distribution). Here is a piece of the data set.
n=50
p=7
X=as.data.frame(MASS::mvrnorm(n, rep(0,p), (0.75)^abs(outer(1:p,1:p,"-"))))
head(X)
## V1 V2 V3 V4 V5 V6 V7
## 1 0.7237 -0.5555 -0.0357 0.9048 1.149 1.100 1.033
## 2 1.0329 -1.2518 -1.7902 -2.0343 -1.009 -0.454 -0.851
## 3 1.8316 0.0258 1.0542 0.1620 -0.341 0.784 0.120
## 4 -0.6123 -0.6342 -0.1753 0.0174 -0.893 -0.226 -0.932
## 5 0.0789 0.6525 0.4579 0.8590 0.415 -0.358 -0.130
## 6 0.0345 -0.3700 -0.5609 -0.1346 -1.742 -1.978 -1.929
For this data set, we can easily visualize the first two variables V1 and V2 with a scatter plot. The figure looks nice and clear. Each dot represents one data point with xy-coordinates representing V1 and V2. To visualize more variables, we can simply use the size of the symbol to represent V3 and the color intensity to represent V4. Now, the figure starts to carry more information, but at the mean time becomes harder to read. If we further add V5 using different type of symbols, we are overwhelmed by the visualization. The figure is so complex that it is hard to recognize major patterns. We can continue in this fashion by adding more variables, however, the figure becomes harder to read as we anticipate. Note that we can visualize all variables with a parallel coordinates where each line represents one observations and its y-coordinates represents all variables. Obviously, this is even harder to read as we cannot even identify the positive correlation among variables.
p1=ggplot(data = X, aes(x=V1, y=V2)) +
geom_point()
p2=ggplot(data = X, aes(x=V1, y=V2, size=V3)) +
geom_point(show.legend = FALSE) +
theme(legend.direction = "vertical",
legend.box = "horizontal")
p3=ggplot(data = X, aes(x=V1, y=V2,
size=V3, color=V4)) +
geom_point(show.legend = FALSE) +
theme(legend.direction = "vertical",
legend.box = "horizontal")
p4=ggplot(data = X, aes(x=V1, y=V2,
size=V3, color=V4,
shape=cut(V5,breaks=4))) +
geom_point(show.legend = FALSE) +
scale_shape_manual(values = c(15,16,17,18))+
theme(legend.direction = "vertical",
legend.box = "horizontal")
p5=ggplot(data = X, aes(x=V1, y=V2,
size=V3, color=V4,
shape=cut(V5,breaks=4),
alpha=V6)) +
geom_point(show.legend = FALSE) +
scale_shape_manual(values = c(15,16,17,18))+
theme(legend.direction = "vertical",
legend.box = "horizontal")
library(GGally)
p6=ggparcoord(X,columns = 1:p, alphaLines = 0.5)
grid.arrange(p1,p2,p3,p4,p5,p6,ncol=3)
As we can see, the more information presented in the figure, the harder for readers to recognize the pattern. Ultimately, we only have three dimensions to play with, the (x,y) coordinates of the pixel and its color. So data visualization is essentially to compress data into this three dimension space while preserving the most information and the easiest identifiable patterns. In other words, visualization is a dimension reduction game. Here is an illustrative picture that shows the idea1. Similar examples are here2 and here3.
Therefore, we need to be selective on variables for visualization.
The real data is usually multivariate. For any questions your proposed, there are usually multiple deciding factors. Visualization to aim to show the whole picture as much as possible instead of focus on a small space. For example, time and location are usually important factor to consider. Conditioning is usually helpful in this case. See the section on Simpson’s paradox for a real example.
Generally, the more dimensions/information of the data we try to present in one figure, the more difficult it is for readers to identify meaningful patterns (or stories) in the data. Therefore, it is a totality/interpretability trade-off as illurstrate in the figure below. This is consistent with humna vision limitation.
Based on my experience, there exists this strange pursuit to add as many details to a single view as possible. In general, let us not overwhelm users with information. Please try to distill for the essentials and give the audience the possibility to dig into the details on their own.
When generating bar plots, the default order of the categories are often alphabetical. However, this is usually not optimal in displaying the data, because it is hard for audience to see any patterns in the data. Alternatively, reordering the bars according to the numerical values is preferred. After reordering, the axis display the categories according to their frequencies descendingly or ascendingly. Here is an example. We use the bar plot to show the number of colleges in each state in USA. On the left panel, we adopt the default order. On the middle panel, we reorder the states according to their frequencies. We immediately see how these states compare to each other. We can even highlight certain states using different colors in the right panel.
Colors can be used to visualize different types of data.
Qualitative color scales: for categorical variables, we choose to use colors that are distinct from each other. These colors should not create the impression of an order and no one should stand out relative to others.
Sequential color scale: for numerical variables, we choose to use colors that can indicate which values are larger or smaller than which other ones and how distant two specific values are from each other.
Diverging color scale: to visualize the deviation of data values in one of two directions relative to a neutral midpoint, we choose the colors that are immediately obvious whether a value is positive or negative as well as how far in either direction.
In the following, we present these three groups of color scales
(sequential, qualitative, and diverging). These colors schemes can be
found in R package RColorBrewer
.
The ColorBrewer website also hosts important information. https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3
In addition to RColorBrewer
package and ColorBrewer
website, Adobe Color is a excellent website for creating colors.
Depending the purposes, you can create colors of different kinds:
Analogous, Monochromatic, Triad, Complementary, Split Complementary,
Double Split, Complementary, Square, Compound, Shades. Keep in mind that
each code such as #033761
represents a unique color. Here
is the link: https://color.adobe.com/create/color-wheel
Do not use more than six colors in one visualization.
In data visualization, we should “let data speak for itself”. Conceptually, we can define the data-ink ratio as
\[ \text{Data Ink Ratio} = \frac{\text{data ink used in the graphic}}{\text{total ink used in the graphic}} \]
Therefore, we essentially maximize the data-ink ratio. Below are a few bad examples where the data-ink ratio are low.
Provide extensive text information such as labels, annotations, axis, and numbers. When excessive amount of information is presented with some ambiguity or distortion, textual information such as thorough labeling often help clarify the objective of the figure.
A data graphic should tell a complete story all by itself. You should not have to refer to extra text or descriptions when interpreting a plot, if possible.
When visualizing data, we should always strive for clarity and simplicity. One of the co-creators of R, Ross Ihaka, has explained his principles in visualization as
“Everything should be made as simple as possible, but no simpler.” - Albert Einstein.
We should also focus on showing insights and story.
“The purpose of visualization is insight, not pictures.” - Ben Shneiderman
“Numerical quantities focus on expected values, graphical summaries on unexpected values.” - John Tukey
Depending the level of the audience, your visualization should adjust accordingly.
About 7% of the male population and 0.5% of the female population are
colorblind.
There are also different types of colorblind. The website https://www.color-blindness.com/ hosts important
information about how different colors are indistinguishable to some
people. We can use the colorblindness simulator to understand what the
colorblind people perceive about the visualization. The simulator is
available at: https://www.color-blindness.com/coblis-color-blindness-simulator/
Therefore, when selecting the color palette, it is important to keep
that in mind. For example, viridis color palettes are great tools for
generating colorblind friendly visualizations.
The exploratory visualization is quick and effective. It may not be carefully design and is often for internal usage, such as the data analysts themselves. On the other hand, explanatory visualization is meant to convince the audience with a particular message. It is meant to be consumed by stakeholders and management teams. In the data analysis flowchart, it is important to know that exploratory visualization often occurs in the loop, whereas explanatory visualization occurs in the communication step.
Throughout this notes, we have mostly focused on static visualization. They are great choices when printing on paper, magazine, poster, etc. They are also easy to understand and all information is presented to the audience at once. Occasionally, for more sophisticated audience, the interactive visualization is a better choice. It gives the audience the freedom and flexibility to change the settings and interact with the visualization to get different results. Below is a list of resources for interactive visualization.
Garbage in garbage out. Use visualization to check data quality. If the data has very little signal, then the visualization cannot do much.
Short attention span. It is important to realize that the people’s attention span is getting shorter nowadays. People have short attention span and easily get bored. If you write a paper/report, you should use a data graphic to make the primary point. Imagine the person you hand the paper/report to has very little time and will only focus on the graphic. Is there enough information on that graphic for the person to get the story?
Tufte’s graphical excellence includes
Bad plots: styles
Bad plots: text
Bad plots: information content
Bad plots: axes
Chapter 2 of Tufte (2001) and in Wainer (1984) have given a few examples of visualization that violate the principles above. Meanwhile, we have also seen many visualization in practice that are questionable. Below is a short summary of the some of these examples.