Ch6 Visualization Principles

After the introduction of the basic data visualization techniques, we are now ready to discuss the principles and guidelines in building data visualization.

There are many existing principles/rules for data visualization available in the literature. Some of them have withstood the test of time, while others are not very relevant in the modern age. Meanwhile, every rule has its own limitation and applicable situation, and each situation is different. Hence it is hard to summarize a few universally applicable principles/rules. This does not mean you can do whatever you want, but instead implies that you should adapt to the context and situation. Instead of blindly trusting a few principles, you should use your own instinct and common sense as the guidance. For example, if you think of yourself as the audience, you should be able to tell where the current visualization need improvement.

We use the following R packages in this chapter.

library(tidyverse)
library(grid)
library(gridExtra)
library(gapminder)
library(GGally)
library(RColorBrewer)

1 Choose the Right Visualization

1.1 Select the Right Visualization Type

To visualize a data set, one important task is to choose an appropriate visualization type, for example, scatter plot, bar plot, and etc. This is essentially to select the geometry layer in ggplot() function. This selection can be hard because there are many different types of visualization. Each one is designed for a particular purpose and a particular type of data. We will discuss these types of visualization in details in the next a few chapters. In this chapter, we mostly focus on the distinction among these visualizations, so that we can select the appropriate visualization type. To select an appropriate visualization, we need to consider the following factors:

Type of data
- Continuous/quantitative
- Discrete/categorical/qualitative
- Spatial data
- Time series data
- Network/relational data
- Unstructured data such as text, video, images
Purpose of the visualization
- Distribution
- Composition
- Association
- Comparison
Level of audience
- Self, i.e., exploratory data visualization.
- Analyst, i.e., explanatory data visualization.
- Executives, i.e., explanatory data visualization and storytelling.

For example, when we consider taking the log transformation on your data before visualization, it may be easy and straightforward to plot the log-transform data directly on the axes if it is for our own exploration. However, for a general audience that is unfamiliar with log transformation and cannot easily convert logged values back to the original values, using the axis with the original value tick labels instead of log-transformed value tick labels will be much easier.

The table below summarizes various of visualization types according to the three questions above. What is the purpose of the visualization? To display a distribution, to compare distributions, or to show association? What variables are to be visualized? Continuous, categorical, mixed, time series, spatial, and etc.? How many variables? The following table gives you an starting point for searching the right visualization type.

Visualization type	Purpose	Continuous	Discrete	Mixed (con. + dis.)
histogram	distribution	1
density plot	distribution	1
violin plot	distribution	1
boxplot	distribution	1
strip plot	distribution	1
beeswarm plot	distribution	1
sina plot	distribution	1
QQ plot	distribution	1
CDF plot	distribution	1
stacked histogram	distribution, composition			1+1
stacked density	distribution, composition			1+1
overlapping histogram	distribution, comparison			1+1
overlapping density	distribution, comparison			1+1
ridgeline plot	distribution, comparison			1+1
side-by-side violin	distribution, comparison			1+1
side-by-side boxplot	distribution, comparison			1+1
side-by-side strip plot	distribution, comparison			1+1
side-by-side beeswarm plot	distribution, comparison			1+1
side-by-side sina plot	distribution, comparison			1+1
overlapping CDF	distribution, comparison			1+1
bar plot	composition		1	1+1 w/ `geom_col()`
stacked bar plot	composition, comparison, association		2
dodged bar plot	composition, comparison, association		2
ternary plot	composition		1
pie chart	composition		1
donuts chart	composition		1
side-by-side pie charts	composition, comparison		2
side-by-side donuts charts	composition, comparison		2
parallel sets	composition, association		>=2
mosaic plot	composition, association		>=2
tree map	composition, association		>=2 hierarchical
heatmap	distribution, association	2 (wide), 3 (long)	2 (wide), 3 (long)	1,2 + 1,2
scatter plot	association	2
3D scatter plot	association	3
colored scatter plot	association	3		2+1
facet scatter plot	association, comparison	>=3		2+1,2
bubble plot	association	3		2+1
density contour	association	2
2d histogram	association	2
hex histogram	association	2
correlogram	association, comparison	>=2
parallel coordinates	association	>=2
bullet graph	comparison	5
Gantt chart	distribution	>=2
population pyramid	distribution	2
surface plot	distribution	3
radviz chart	distribution	>=3
dendrogram	distribution	>=1
barcode plot	distribution	>=1
voronoi diagram	composition, distribution	>=2
Nightingale rose chart	composition			1+1
sunburst	composition			1+1,2,3…
Sankey diagram	composition, association			1+1,2,3…
Cleveland dot/lollipop plot	composition, distribution			1+1
slope graph	association			1+1
spider/radar/star plot	display,compare			1+1
dumbbell/arrow plot	display,compare			2+1
circle packing	composition			1+1
line plot	distribution			time series
connected scatter plot	association			time series
dual-axis line plot	distribution			time series
area plot	distribution			time series
stacked area plot	distribution			time series
shaded plot	distribution			time series
stream/river graph	distribution			time series
horizon graph	distribution			time series
calendar-based graphics	distribution			time series
spiral plot	distribution			time series
map	distribution			spatial data
choropleth map	distribution			spatial data
cartogram heatmap	distribution			spatial data
facet map	distribution			spatial data
connection map	distribution			spatial data
arc diagram	distribution			network data
chord diagram	distribution			network data
word cloud	distribution			text data

Some general recommendations:

The scatter plot is recommended for visualizing the association between two quantitative variables. It is also frequently used to visualize the distribution of these variables.
The bubble plot is recommended for visualizing the association between three quantitative variables.
The line plot is recommended for visualizing the one quantitative variable measured over time.
The bar plot is recommended for visualizing the composition (i.e., frequencies) of one categorical variable. It is also used to show the rank and order of these categories.
The stacked bar plot is recommended for comparing the composition of different categorical variables.
The treemap is recommended for visualizing the hierarchical composition of the categorical variables.

Note that there are always some exceptions to these rules introduced above. So the selection of the visualization type really depends on the particular situation. It is important to try multiple visualization types before finalize one particular type.

There are many other resources for visualization type categories. The following websites provide many different types of visualizations.

There are also excellent resources on visualization selection principles, for example, The Graphic Continuum and Chart Suggestions — A Thought-Starter

Rome wasn’t built in a day. When constructing a visualization, you should take the trial and error approach. Start with something basic and gradually add on more features and remove the necessary components. During this iterative process, let your instinct be your guide.

1.2 Visualization Types to be Avoided

Among the visualizations mentioned in the previous section, there are some visualization types to be avoided.

1.2.1 Avoid Pie Chart and Donut Chart

Instead of the pie chart or the donut chart, we should use bar plot to represent the composition of a categorical variable and show comparison. The pie chart is not preferred because human vision is not as sensitive to (unaligned) angels as to length. For example, the pie chart cannot easily distinguish small difference in proportion, such as 45 degree and 46 degree, even when the two slices of pie are next to each other. The bar plot on the other hand can easily show which category is slight more when the bars are aligned at the bottom. Below is the visualization of the proportion of colleges in each state among five midwest states in USA. The pie chart and donut chart do not present a clear picture about the magnitude, whereas the bar plot displays the amounts easily.

1.2.2 Avoid 3D Visualization

The 3D visualization should generally be avoided since most of the visualization is on a screen or paper. For example, in a 3D scatter plot, the extra dimensions usually cannot be efficiently represented (see below). It is hard to identify the exact location of each dot/observation and creates visual burdens to the analyst. One quick way to fix 3D visualization is to make it interactive. We have shown a few examples in the chapter where we develop visualization for high-dimensional data. However, they are often much harder to generate and cannot be printed on paper.

Below is a comparison of 3D line plot and 2D line plot. The survival rates are plotted against the logarithm of dose for different drug types. In the 3D visualization, it is hard to determine the values of the survival rate, especially when the ribbons intersect with each other. Alternatively, we can convert the visualization to 2D, which is more clear and easy to understand.

Sometimes, the 3D visualization even provides misleading visualization. Below is an example, because of the 3D effect, the last two year’s number looks like a big jump, whereas the reality is that the number only slightly increases.

In addition, the dual-axis line plot should be avoided because it is confusing to readers to distinguish which series is for which axis.

Lastly, the stacked area plot should be avoided because it is hard to know whether the area is stacked or overlapped.

1.3 Tables versus Visualization

Visualization may not be preferred in all the situations. For example, if the analyst would like to know the actual values, then tables are preferred over charts. In addition, if the numbers are in different scales, i.e., millions, billions, and tens, then tables can efficiently display these values.

On the other hand, actual values displayed in R has a default number of digits of 7, which is often too many. The redundant digits creates burdens for the analyst when interpreting the results. As long as the number of digits is enough to distinguish most of the numbers, it is usually sufficient. Useful functions to set the number of significant digits or to round numbers are signif and round. The number of significant digits can be set globally by options(digits = 3).

options("digits")

## $digits
## [1] 7

1/3

## [1] 0.3333333

options(digits=3)
1/3

## [1] 0.333

round(1/3,digits = 1)

## [1] 0.3

Here is an example of tables with too many digits and improved version. Note that the it is usually better to compare numbers when they are aligned vertically, i.e., in the same column, than horizontally. This is because the lengths of the numbers and decimal digits are easier to see.

state	year	Measles	Pertussis	Polio
California	1940	37.8826320	18.3397861	0.8266512
California	1950	13.9124205	4.7467350	1.9742639
California	1960	14.1386471	NA	0.2640419
California	1970	0.9767889	NA	NA
California	1980	0.3743467	0.0515466	NA

state	year	Measles	Pertussis	Polio
California	1940	37.9	18.3	0.8
California	1950	13.9	4.7	2.0
California	1960	14.1	NA	0.3
California	1970	1.0	NA	NA
California	1980	0.4	0.1	NA

As an example, here are the per 10,000 disease rates, computed from totals and population in R, for California across the five decades:

We are reporting precision up to 0.00001 cases per 10,000, a very small value in the context of the changes that are occurring across the dates. In this case, two significant figures is more than enough and clearly makes the point that rates are decreasing:

2 Faithful Visualization

The visualization should represent the data faithfully in the sense that the true effect/variability in the data is proportional to the effect/variability shown in the visualization.
However, in practice, due to various reasons, distortion often take place without being noticed. Conceptually, we can define the lie factor as

\[ \text{Lie Factor}= \frac{\text{size of effect shown in figure}}{\text{size of effect in data}} \]

Ideally, the lie factor should be close to 1. Deviation from 1 in either direction should be avoided. Usually, the design of the axis is one of the most common place where lie factor is blown up.

For example, when using bar plots, it is misleading to start the bars not at 0. This is because that using position to visualize only require all positions to be in a same, aligned the coordinate system. If you really do not want to start at 0, then use scatter plot instead of bar plot.

Another example is to use the area of circle to represent quantity. Intuitively, the area of the circle should correspond to and be proportional to the quantity. However, the radius is often used instead of area. This will give audience a false impression. See the example below.

Fortunately, ggplot2 defaults to using area rather than radius. Additionally, we could use position and length to represent the same data even better.

gdp <- c(14.6, 5.7, 5.3, 3.3, 2.5)
gdp_data <- data.frame(Country = rep(c("United States", "China", "Japan", "Germany", "France"),3),
           y = factor(rep(c("Radius","Area","GDP"),each=5), levels = c("Radius", "Area", "GDP")),
           GDP= c(gdp^2/min(gdp^2), gdp/min(gdp), gdp)) %>% 
   mutate(Country = reorder(Country, GDP))
g1 = gdp_data %>% 
  filter(y != "GDP") %>%
  ggplot(aes(Country, y, size = GDP)) + 
  geom_point(show.legend = FALSE, color = "blue") + 
  scale_size(range = c(1,5.84)*4.2) +
  ylab("") + xlab("")
g2 = gdp_data %>% 
  filter(y == "GDP") %>%
  ggplot(aes(Country, GDP)) + 
  geom_bar(stat = "identity", width = 0.5) + 
  ylab("GDP in trillions of US dollars")
grid.arrange(g1,g2,ncol=1)

This visualization principal is essentially the proportional ink principal. It means that the representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities measured.

3 Efficient Visualization

3.1 Use the Aesthetic Dimension Efficiently

Different aesthetic dimensions have different expressive power in visualizing data. Human vision is most sensitive to lengths and positions than to areas and colors. Therefore, when choose the aesthetic mapping, we should use the more sensitive aesthetic dimensions for the most important variables.
For example, in scatter plot, the first two dimensions are the most important part.
Color and size are secondary.

For continuous variable, the aesthetic dimensions are

Position on a common scale
Aligned length
Length
Aligned angle
Area
Sequential or gray-scale color
Line width

For discrete variable, the aesthetic dimensions are

Qualitative color
Shape
Length
Aligned angle
Line type

3.2 Show Comparison and Show Data

Comparison is the basis for scientific discovery. The patterns in visualization is easy to assess when compared with a reference level. When visualizing the data, providing a reference level will help the audience understand the significance of the pattern. When you make a claim from the data visualization, your statement should always be relative. For example, relative to a baseline level, relative to healthy individuals, relative to a city, and etc. Therefore, we should always compare two patterns instead of visualizing just one.

When comparing two subset of data, remember to use the common axes and common ranges. Also use the vertical layout for the common horizontal axis and the horizontal layout for the common vertical axis.

Another important issue is to always try to show data. Comparing visualization or summary statistics are helpful, but the actual data make the comparison more details. When comparing densely distributed data points, we can often use jittering or make data symbol more transparent.

library(tidyverse)
g1 = ggplot(mpg)+
  geom_density(aes(x=hwy))+
  facet_wrap(~drv, ncol = 3, scales="free_x")
g2 = ggplot(mpg)+
  geom_density(aes(x=hwy))+
  facet_wrap(~drv,ncol = 1, scales="free_x")
g3 = ggplot(mpg)+
  geom_density(aes(x=hwy))+
  facet_wrap(~drv,ncol = 1)
g4 = ggplot(mpg)+
  geom_boxplot(aes(x=hwy,y=drv))+
  geom_jitter(aes(x=hwy,y=drv), alpha=0.5)
library(gridExtra)
plots=list(g1,g2,g3, g4)
layout_mat = rbind(c(1, 1, 1),
                   c(2, 3, 4))
grid.arrange(grobs = plots,
             layout_matrix = layout_mat, # matrix of layout
             widths = c(1, 1, 1),heights = c(0.75, 1.5))

When comparing quantities of different scales, it is usually useful to consider the log or square root transformations. In addition, you can incorporate a hierarchy of comparisons, for example, across different continents and then across different years.

library(dslabs)
data("gapminder")
g1 = gapminder %>% 
  filter(year==2010) %>%
  ggplot() +
  geom_jitter(aes(x=continent, y=population),width = 0.2)
g2 = gapminder %>% 
  filter(year==2010) %>%
  ggplot() +
  geom_boxplot(aes(x=continent, y=population)) +
  geom_jitter(aes(x=continent, y=population),width = 0.2) +
  scale_y_continuous(trans = "log10", breaks = 10^(5:9), labels = format(10^(5:9), 
                                                                         scientific = FALSE, 
                                                                         big.mark = ',' ))
g3 = 
  gapminder %>% 
  filter(year %in% c(1975,2010)) %>%
  ggplot() +
  geom_boxplot(aes(continent, population, fill = factor(year))) +
  scale_y_continuous(trans = "log10", breaks = 10^(5:9), labels = format(10^(5:9), 
                                                                         scientific = FALSE, 
                                                                         big.mark = ',' ))
grid.arrange(g1,g2,g3,ncol=1)

3.3 How Much Information can a Figure Show?

One important question in data visualization is to select what variables to plot. To illustrate this issue, we start with a simple example of a scatter plot. How much information can a scatter plot show? Suppose our data has 6 variables V1 through V6 (simulated from a multivariate normal distribution). Here is a piece of the data set.

n=50
p=7
X=as.data.frame(MASS::mvrnorm(n, rep(0,p), (0.75)^abs(outer(1:p,1:p,"-"))))
head(X)

##        V1      V2      V3      V4     V5     V6     V7
## 1  0.7237 -0.5555 -0.0357  0.9048  1.149  1.100  1.033
## 2  1.0329 -1.2518 -1.7902 -2.0343 -1.009 -0.454 -0.851
## 3  1.8316  0.0258  1.0542  0.1620 -0.341  0.784  0.120
## 4 -0.6123 -0.6342 -0.1753  0.0174 -0.893 -0.226 -0.932
## 5  0.0789  0.6525  0.4579  0.8590  0.415 -0.358 -0.130
## 6  0.0345 -0.3700 -0.5609 -0.1346 -1.742 -1.978 -1.929

For this data set, we can easily visualize the first two variables V1 and V2 with a scatter plot. The figure looks nice and clear. Each dot represents one data point with xy-coordinates representing V1 and V2. To visualize more variables, we can simply use the size of the symbol to represent V3 and the color intensity to represent V4. Now, the figure starts to carry more information, but at the mean time becomes harder to read. If we further add V5 using different type of symbols, we are overwhelmed by the visualization. The figure is so complex that it is hard to recognize major patterns. We can continue in this fashion by adding more variables, however, the figure becomes harder to read as we anticipate. Note that we can visualize all variables with a parallel coordinates where each line represents one observations and its y-coordinates represents all variables. Obviously, this is even harder to read as we cannot even identify the positive correlation among variables.

p1=ggplot(data = X, aes(x=V1, y=V2)) + 
  geom_point()
p2=ggplot(data = X, aes(x=V1, y=V2, size=V3)) + 
  geom_point(show.legend = FALSE) + 
  theme(legend.direction = "vertical", 
        legend.box = "horizontal")
p3=ggplot(data = X, aes(x=V1, y=V2, 
                        size=V3, color=V4)) + 
  geom_point(show.legend = FALSE) + 
  theme(legend.direction = "vertical", 
        legend.box = "horizontal")
p4=ggplot(data = X, aes(x=V1, y=V2, 
                        size=V3, color=V4,
                        shape=cut(V5,breaks=4))) +
  geom_point(show.legend = FALSE) + 
  scale_shape_manual(values = c(15,16,17,18))+
  theme(legend.direction = "vertical", 
        legend.box = "horizontal")
p5=ggplot(data = X, aes(x=V1, y=V2, 
                        size=V3, color=V4,
                        shape=cut(V5,breaks=4),
                        alpha=V6)) +
  geom_point(show.legend = FALSE) + 
  scale_shape_manual(values = c(15,16,17,18))+
  theme(legend.direction = "vertical", 
        legend.box = "horizontal")
library(GGally)
p6=ggparcoord(X,columns = 1:p, alphaLines = 0.5)
grid.arrange(p1,p2,p3,p4,p5,p6,ncol=3)

As we can see, the more information presented in the figure, the harder for readers to recognize the pattern. Ultimately, we only have three dimensions to play with, the (x,y) coordinates of the pixel and its color. So data visualization is essentially to compress data into this three dimension space while preserving the most information and the easiest identifiable patterns. In other words, visualization is a dimension reduction game. Here is an illustrative picture that shows the idea¹. Similar examples are here² and here³.

Therefore, we need to be selective on variables for visualization.

The real data is usually multivariate. For any questions your proposed, there are usually multiple deciding factors. Visualization to aim to show the whole picture as much as possible instead of focus on a small space. For example, time and location are usually important factor to consider. Conditioning is usually helpful in this case. See the section on Simpson’s paradox for a real example.

Generally, the more dimensions/information of the data we try to present in one figure, the more difficult it is for readers to identify meaningful patterns (or stories) in the data. Therefore, it is a totality/interpretability trade-off as illurstrate in the figure below. This is consistent with humna vision limitation.

Based on my experience, there exists this strange pursuit to add as many details to a single view as possible. In general, let us not overwhelm users with information. Please try to distill for the essentials and give the audience the possibility to dig into the details on their own.

3.4 Create a Meaningful Order of Categories

When generating bar plots, the default order of the categories are often alphabetical. However, this is usually not optimal in displaying the data, because it is hard for audience to see any patterns in the data. Alternatively, reordering the bars according to the numerical values is preferred. After reordering, the axis display the categories according to their frequencies descendingly or ascendingly. Here is an example. We use the bar plot to show the number of colleges in each state in USA. On the left panel, we adopt the default order. On the middle panel, we reorder the states according to their frequencies. We immediately see how these states compare to each other. We can even highlight certain states using different colors in the right panel.

3.5 Colors

Colors can be used to visualize different types of data.

Qualitative color scales: for categorical variables, we choose to use colors that are distinct from each other. These colors should not create the impression of an order and no one should stand out relative to others.

Sequential color scale: for numerical variables, we choose to use colors that can indicate which values are larger or smaller than which other ones and how distant two specific values are from each other.

Diverging color scale: to visualize the deviation of data values in one of two directions relative to a neutral midpoint, we choose the colors that are immediately obvious whether a value is positive or negative as well as how far in either direction.

In the following, we present these three groups of color scales (sequential, qualitative, and diverging). These colors schemes can be found in R package RColorBrewer.

The ColorBrewer website also hosts important information. https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3

In addition to RColorBrewer package and ColorBrewer website, Adobe Color is a excellent website for creating colors. Depending the purposes, you can create colors of different kinds: Analogous, Monochromatic, Triad, Complementary, Split Complementary, Double Split, Complementary, Square, Compound, Shades. Keep in mind that each code such as #033761 represents a unique color. Here is the link: https://color.adobe.com/create/color-wheel

Do not use more than six colors in one visualization.

3.6 Data-Ink Ratio

In data visualization, we should “let data speak for itself”. Conceptually, we can define the data-ink ratio as

\[ \text{Data Ink Ratio} = \frac{\text{data ink used in the graphic}}{\text{total ink used in the graphic}} \]

Therefore, we essentially maximize the data-ink ratio. Below are a few bad examples where the data-ink ratio are low.

3.7 Providing the Context Through Annotations

Provide extensive text information such as labels, annotations, axis, and numbers. When excessive amount of information is presented with some ambiguity or distortion, textual information such as thorough labeling often help clarify the objective of the figure.

A data graphic should tell a complete story all by itself. You should not have to refer to extra text or descriptions when interpreting a plot, if possible.

3.8 Less is More

When visualizing data, we should always strive for clarity and simplicity. One of the co-creators of R, Ross Ihaka, has explained his principles in visualization as

If the “story” is simple, keep it simple.
If the “story” is complex, make it look simple.
Tell the truth.

“Everything should be made as simple as possible, but no simpler.” - Albert Einstein.

We should also focus on showing insights and story.

“The purpose of visualization is insight, not pictures.” - Ben Shneiderman

“Numerical quantities focus on expected values, graphical summaries on unexpected values.” - John Tukey

Visualizing patterns
Spotting differences
Show the expected and the unexpected

4 Know Your Audience

Depending the level of the audience, your visualization should adjust accordingly.

4.1 Colorblind Friendly Visualization

About 7% of the male population and 0.5% of the female population are colorblind.
There are also different types of colorblind. The website https://www.color-blindness.com/ hosts important information about how different colors are indistinguishable to some people. We can use the colorblindness simulator to understand what the colorblind people perceive about the visualization. The simulator is available at: https://www.color-blindness.com/coblis-color-blindness-simulator/ Therefore, when selecting the color palette, it is important to keep that in mind. For example, viridis color palettes are great tools for generating colorblind friendly visualizations.

4.2 Exploratory and Explanatory Visualization

The exploratory visualization is quick and effective. It may not be carefully design and is often for internal usage, such as the data analysts themselves. On the other hand, explanatory visualization is meant to convince the audience with a particular message. It is meant to be consumed by stakeholders and management teams. In the data analysis flowchart, it is important to know that exploratory visualization often occurs in the loop, whereas explanatory visualization occurs in the communication step.

4.3 Interactive Visualization

Throughout this notes, we have mostly focused on static visualization. They are great choices when printing on paper, magazine, poster, etc. They are also easy to understand and all information is presented to the audience at once. Occasionally, for more sophisticated audience, the interactive visualization is a better choice. It gives the audience the freedom and flexibility to change the settings and interact with the visualization to get different results. Below is a list of resources for interactive visualization.

5 Other Tips

Garbage in garbage out. Use visualization to check data quality. If the data has very little signal, then the visualization cannot do much.

Short attention span. It is important to realize that the people’s attention span is getting shorter nowadays. People have short attention span and easily get bored. If you write a paper/report, you should use a data graphic to make the primary point. Imagine the person you hand the paper/report to has very little time and will only focus on the graphic. Is there enough information on that graphic for the person to get the story?

Tufte’s graphical excellence includes

…is the well designed presentation of interesting data - a matter of substance, of statistics and of design.
…consists of complex data communicated with clarity, precision and efficiency.
…is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.
…is nearly always multivariate.
…requires telling the truth about the data.

Bad plots: styles

color-blind-friendly (e.g. primarily red and green)
Wrong palette for data type (remember sequential, qualitative and divergent)
Indistinguishable groups (i.e. colors are too similar)
Ugly (high saturation primary colors)

Bad plots: text

Illegible (e.g. too small, poor resolution)
Non-descriptive (e.g. “length” – of what? which units?)
Missing Inappropriate (e.g. comic sans)

Bad plots: information content

Too much information (TMI)
Too little information (TLI)
No clear message or purpose

Bad plots: axes

Poor aspect ratio
Suppression of the origin
Broken x or y axes
Common, but unaligned scales
Wrong or no transformation

6 Bad Examples

Chapter 2 of Tufte (2001) and in Wainer (1984) have given a few examples of visualization that violate the principles above. Meanwhile, we have also seen many visualization in practice that are questionable. Below is a short summary of the some of these examples.

References

Tufte, Edward R. 2001. The Visual Display of Quantitative Information. 2nd ed. Graphics Press. https://www.amazon.com/Visual-Display-Quantitative-Information/dp/1930824130.

Wainer, Read H. 1984. “How to Display Data Badly.” The American Statistician 38 (2): 137–47. https://doi.org/10.2307/2683253.

Ch6 Visualization Principles

Descriptive Analytics and Data Visualization

Yichen Qin (qinyn@ucmail.uc.edu), University of Cincinnati

2024-12-20

Ch6 Visualization Principles

1 Choose the Right Visualization

1.1 Select the Right Visualization Type

1.2 Visualization Types to be Avoided

1.2.1 Avoid Pie Chart and Donut Chart

1.2.2 Avoid 3D Visualization

1.3 Tables versus Visualization

2 Faithful Visualization

3 Efficient Visualization

3.1 Use the Aesthetic Dimension Efficiently

3.2 Show Comparison and Show Data

3.3 How Much Information can a Figure Show?

3.4 Create a Meaningful Order of Categories

3.5 Colors

3.6 Data-Ink Ratio

3.7 Providing the Context Through Annotations

3.8 Less is More

4 Know Your Audience

4.1 Colorblind Friendly Visualization

4.2 Exploratory and Explanatory Visualization

4.3 Interactive Visualization

5 Other Tips

6 Bad Examples

References