Ch5 Visualization Case Study

In this chapter, we use a few case studies to demonstrate how data visualization techniques help us extract insights from data. The following R packages are needed for running the examples in this chapter.

library(tidyverse)
library(ggthemes)
library(grid)
library(gridExtra)
#library(Lahman)
#library(MASS) # conflict with dplyr package for select(), use dplyr::select
library(RColorBrewer)
library(ggridges)
library(ggrepel)
library(kableExtra)

1 Case Study 1: Simpson’s Paradox in Baseball

We analyze a data set containing the offense and defense statistics for the baseball teams in Major League Baseball in the early years, in particular, from 1871 through 1900. The data set includes baseball teams from many “major” leagues such as American Association, Union Association, Players League, Federal League, and the National Association (of 1871-1875). This data set was created by Sean Lahman, who pioneered the effort to make baseball statistics freely available to the general public. The complete data set includes most recent data is available in the R package Lahman.

We first load the data and print the first a few rows.

Teams = read.csv("data/baseballgame_bf1900.csv")
head(Teams)

##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871 <NA>    BS1      BNA    NA    3 31    NA 20 10     NA    NA     N
## 2   1871 <NA>    CH1      CNA    NA    2 28    NA 19  9     NA    NA     N
## 3   1871 <NA>    CL1      CFC    NA    8 29    NA 10 19     NA    NA     N
## 4   1871 <NA>    FW1      KEK    NA    7 19    NA  7 12     NA    NA     N
## 5   1871 <NA>    NY2      NNA    NA    5 33    NA 16 17     NA    NA     N
## 6   1871 <NA>    PH1      PNA    NA    1 28    NA 21  7     NA    NA     Y
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
## 6  <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
##   IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1    828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2    753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5    879 373   7  42  22 235 14 0.840        New York Mutuals
## 6    747 329   3  53  16 194 13 0.845  Philadelphia Athletics
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6     Jefferson Street Grounds         NA 102  98      ATH            PH1
##   teamIDretro
## 1         BS1
## 2         CH1
## 3         CL1
## 4         FW1
## 5         NY2
## 6         PH1

Each row in the data set represents various statistics of one team in a particular season. The statistics include the number of runs scored by that team in that year/season (R), the number of home runs (HR), the number of games (G), lost games/losses (L), and wins (W) and many others variables. Suppose we are interested in the relationship between the number of runs (R) and the number of lost games (L). We can easily generate a scatter plot to show the relationship.

Teams %>% 
  ggplot(aes(R, L)) +  
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

Usually, more runs lead to higher scores and hence fewer losses. However, when we visualize the data, we actually get a positive correlation, which is puzzling. We further fit a regression line to the data, which confirms our observations.

In order to understand this puzzle, we recall that the data we have is the statistics of each team in the early years. Back in the 1870s and 1880s, the baseball league is in its infancy. The league has experienced the ups and downs when it developed to its current scale. Therefore, each season is quite different in all aspects from another season. Therefore, when we pool the data across all the years together, the data tells a very different story.

For example, the number of runs and the number of losses depend on how many games each team played during that season. Since the league is growing rapidly, the number of games in each season is much different. Some season hold more games and some hold less.

Therefore, we could reexamine the relationship between the number of runs and the number of losses conditional on the number of games that each team plays in that season. This is sometimes called stratification. Essentially we divide the observations into groups according to the value of a third variable, i.e., the number of games in this case, so that the observations in each group all have similar or identical values for that variable. Within the group, we investigate the data set focusing on the two variables of interest, i.e., the number of runs and losses in this case. Then our results and findings are now conditional on the third variable by which we stratify the data.

Since we want to stratify by the number of games and it is a quantitative variable, we need to cut this variable into several bins/intervals. We can first check the range of the number of games as follow.

range(Teams$G)

## [1]   6 158

It seems an interval of length 20 will give us about 8 strata. We use the function cut() to convert the quantitative variable to a categorical variable with values of (0, 20], … , (140, 160]. To visualize the strata, we use the visualization technique called faceting. It essentially plots the relationship of interest in many small multiple of panels which correspond to the strata. Here, we plot the scatter plot between the numbers of runs and losses within each stratum.

Teams %>%
  mutate(G_strata = cut(G, seq(0,180,20))) %>% 
  ggplot(aes(R, L)) +  
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap( ~ G_strata)

We immediately find out that the relationship becomes negative within each stratum. If we overlay the fitted regression within each stratum on the fitted regression to the whole data set, we can see the contrast.

Teams %>% 
  mutate(G_strata = cut(G, seq(0,180,20))) %>%
  ggplot(aes(R, L, color=G_strata)) +  
  geom_point(alpha = 0.5) + 
  geom_smooth(aes(group=G_strata, color=G_strata),
              method = "lm")+
  scale_color_viridis_d()+
  geom_smooth(color = "blue", method = "lm")

Clearly, the relationship between the numbers of runs and losses within strata and over the entire data set is opposite. This is a typical case of so called Simpson’s paradox. It is called a paradox because we see the sign of the correlation flips when comparing the entire population and specific groups (or strata). It usually happens because there is a confounder variable which affects both of the variables of interest. Since the confounder’s variation is more dominant, the overall pattern between the two variables of interest, when marginalized over the confounder, shows an unexpected association.

In this case, the number of game in each season is the confounder. The more game the team plays, the more runs and more losses the team acquires. When the analysis is stratified over the confounder, the true relationship appears.

We can further investigate this issue. We plot the year against the number of games, the number of runs, and the number of losses. We can immediately see that, as the sport becomes more popular over the years, there are more games in each season, with a few exceptions around 1984-1985. Therefore, the increase of the number of games over the years is the main reason for observing this positive relationship between the number of runs and the number of losses.

plot1 <- Teams %>% 
  ggplot(aes(yearID, G)) + geom_jitter(alpha = 0.5)
plot2 <- Teams %>% 
  ggplot(aes(yearID, R)) + geom_jitter(alpha = 0.5)
plot3 <- Teams %>%
  ggplot(aes(yearID, L)) + geom_jitter(alpha = 0.5)
plot4 <- Teams %>%
  ggplot(aes(G, R)) + geom_jitter(alpha = 0.5)
plot5 <- Teams %>%
  ggplot(aes(G, L)) + geom_jitter(alpha = 0.5)
grid.arrange(plot1, plot2, plot3, plot4, plot5, ncol=3)

This example shows that how faceting in data visualization can help us identify/explain/investigate the Simpson’s paradox in the data set. One important tool for dealing with the Simpson’s paradox is conditionaling, or stratification. Conditional/stratification is achieved in data visualization using the facet feature, where small multiple figures are generated for comparison.

2 Case Study 2: London’s Wind Speed and Direction

Next, we analyze a wind data set containing the hourly measurements of the wind speed (ws) and direction (wd) in London in 1999. The data is available in the R package openair. We first load the data and print the first a few rows.

wind = read.csv(file="data/LondonWind1999.csv", header = TRUE)
head(wind)

##                  date   ws  wd nox no2 o3 pm10      so2       co pm25
## 1 1999-01-01 00:00:00 5.04 140  88  35  4   21 3.835000 1.025000   18
## 2 1999-01-01 01:00:00 4.08 160 132  41  3   17 5.243333 2.700000   11
## 3 1999-01-01 02:00:00 4.80 160 168  40  4   17 6.513333 2.866667    8
## 4 1999-01-01 03:00:00 4.92 150  85  36  3   15 4.180000 1.625000   10
## 5 1999-01-01 04:00:00 4.68 150  93  37  3   16 4.250000 1.025000   11
## 6 1999-01-01 06:00:00 3.84 170 102  32  4   16 4.005000 0.800000    9

As we can see, each row represents one observation of the wind speed and direction at a particular hour in a particular day in 1999. We mostly focus on the wind speed (ws) and direction (wd) and ignore the other variables for now. For the wind direction, we list all possible directions which are equally spaced with 10 degrees aprt from each other.

sort(unique(wind$wd))

##  [1]   0  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180
## [20] 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360

For this data set, we start with the scatter plot of direction versus speed in the upper left panel in the figure below. We can quickly visualize the data, but it does not show many insights because of the overlapping observations. We solve this issue by jittering the observations and immediately see that the wind direction is mostly focus on the 180-270 degree range. Meanwhile, the wind speed is also faster in these directions.

g1=ggplot(wind)+geom_point(aes(x=wd, y=ws),alpha=0.5, size=0.8)
g2=ggplot(wind)+geom_point(aes(x=factor(wd), y=ws),
                           position = position_jitter(0.3),
                           size = 0.8, alpha=0.2)+
  scale_x_discrete(breaks=seq(0,360,30))
grid.arrange(g1,g2,ncol=2)

Even though the scatter plot is sufficient for faithfully presenting the data, but it is hard to obtain any insights. If we focus on the distribution of the wind speed in each direction, we could use side by side boxplot. In the left panel of the figure below, we can see that the wind speed distribution gradually shift to large values as the wind direction moves towards 220 degree. Once improvement we can make about this visualization is that since the wind direction is in a circular format, we can arrange these boxplots in a polar coordinate system as opposed to the Cartesian coordinate system.

g1=ggplot(wind)+geom_boxplot(aes(x=factor(wd), y=ws),outlier.size = 0.5)+
  scale_x_discrete(breaks=seq(0,360,30))
wind$wd[wind$wd==360]=0
g2=ggplot(wind)+geom_boxplot(aes(x=factor(wd), y=ws),outlier.size = 0.5)+coord_polar(start = -pi/32)
grid.arrange(g1,g2,ncol=2)

In the right panel, we can see that boxplots are aligned in a circle. Through the median and first and third quartiles, we can clearly see that the wind speed is much faster in the southwest wind direction. The east wind has relatively fast speed. These insights are harder to extract from the Cartesian coordinate system, which is the advantage of the polar coordinate system.

Using the boxplot in the polar coordinate system is certainly useful, however, there is a shortcoming. For example, we can only visualize the distribution of wind speed in each direction, but cannot account for the frequency of each direction. From the jittered scatter plot, we know that not only does the wind speed is faster in the southwest direction, but there are also more observations of southwest wind than any other directions. Such information cannot be visualized in the figure above.

Alternatively, to visualize the frequency of the wind direction as well as the distribution of the wind speed in each direction, we could use barplot. We first convert the wind speed to different intervals of speed. Then we can use the stacked bar plot to show the frequencies of the wind speed intervals in each direction, shown in the left panel of the figure below. Lastly, we convert the stacked bar plot to the polar coordinate system and show it on the right panel.

wind$binned_speed = cut(wind$ws, c(seq(0,14,2),Inf), include.lowest = TRUE)
wind$direction = 
  cut(wind$wd, 
      c(-1,11.25,33.75,56.25,78.75,101.25,123.75,146.25,168.75,191.25,213.75,236.25,258.75,281.25,303.75,326.25,348.75,361),
      labels = c("N","NNE","NE","ENE","E","ESE","SE","SSE","S","SSW","SW","WSW","W","WNW","NW","NNW","N"))
g3=ggplot(wind, aes(x=direction, fill = binned_speed)) +
  geom_bar(position = position_stack(reverse = TRUE)) +
  scale_fill_viridis_d()+
  theme(legend.position = "top")
g4=ggplot(wind, aes(x=direction, fill = binned_speed)) +
  geom_bar(width=1, position = position_stack(reverse = TRUE)) +
  scale_fill_viridis_d()+
  coord_polar(start = -pi/16)
grid.arrange(g3,g4,ncol=2)

In this visualization, we can see that the west, southwest, south bars take a large portion among all bars, which means the wind mostly comes from these directions in London. In addition, the wind blows fast in these directions because the colors of the bars show more yellow segments This visualization is called wind rose plot. It is powerful because it can visualize the wind speed and direction at the same time. Because of the polar coordinate system, the visualization is very easy to digest and intuitive. There are also many extension of this visualization, for example, the choice of the color scheme.

3 Case Study 3: Basketball Shots

From Charlotte Wickham’s example

Where the Heat and the Thunder Hit Their Shots

shots = read.csv("data/shots_sum.csv", header = TRUE)
head(shots)

##   x  y num_shots num_made prop_made avg_points total_points
## 1 0  4         2        1       0.5        1.5            3
## 2 0  5         3        0       0.0        0.0            0
## 3 0  6         1        0       0.0        0.0            0
## 4 0  7         1        0       0.0        0.0            0
## 5 0  9         2        0       0.0        0.0            0
## 6 0 10         1        0       0.0        0.0            0

shots = filter(shots, num_shots < 1000)
ggplot(shots, aes(x, y))+
  geom_point(aes(color = avg_points, 
                 size = num_shots)) +
  scale_color_distiller("Points", palette = "RdYlGn") +
  scale_size("Attempts", range = c(0.1, 5) ) +
  ylim(0, 35) +
  coord_equal() +
  theme_void() +
  theme(axis.ticks = element_blank(),
        axis.text  = element_blank(),
        axis.line  = element_blank(),
        axis.title = element_blank())

## Warning: Removed 54 rows containing missing values (`geom_point()`).

ggplot(shots, aes(x, y))+
  geom_point(aes(color = avg_points)) +
  scale_color_distiller("Points", palette = "RdYlGn") +
  scale_size("Attempts", range = c(0.1, 5) ) +
  ylim(0, 35) +
  coord_equal() +
  theme_void() +
  theme(axis.ticks = element_blank(),
        axis.text  = element_blank(),
        axis.line  = element_blank(),
        axis.title = element_blank())

## Warning: Removed 54 rows containing missing values (`geom_point()`).

4 Case Study 4:

From Charlotte Wickham’s example

library(dplyr)
library(hflights)
library(ggplot2)

hflights_df <- tbl_df(hflights)

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

hflights_df <- mutate(hflights_df, 
                      DepHour = floor(DepTime/100),
                      DayOfWeek = factor(DayOfWeek, 
                                         labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
                      Date = ISOdate(Year, Month, DayofMonth)
)
hou <- filter(hflights_df, Origin == "HOU")

hou_mon <- filter(hou, DayOfWeek == "Mon")

# over all mondays in 2011, avg delay of flights departing by hour
hou_mon_avg <- hou_mon %>%
  group_by(DepHour) %>%
  summarise(avg_delay = mean(DepDelay))

# initial plot
ggplot(hou_mon_avg, aes(DepHour, avg_delay)) + 
  geom_point() +
  geom_line() + 
  ylab("Average delay (mins)") +
  xlab("Departure time") +
  scale_x_continuous(breaks = seq(0, 24, 6),
                     labels = c("midnight", "6am", "noon", "6pm", "midnight")) +
  theme_bw(18)

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 row containing missing values (`geom_line()`).

#ggsave("08-monday.png", width = 6, height = 4)

# for each monday in 2011, avg delay of flights departing by hour
hou_mon_day <- filter(hou, DayOfWeek == "Mon") %>%
  group_by(Date, DepHour) %>%
  summarise(avg_delay = mean(DepDelay))

## `summarise()` has grouped output by 'Date'. You can override using the
## `.groups` argument.

# quantiles for delay by time
hou_mon_q <- hou_mon %>% group_by(DepHour) %>%
  summarise(n = n(),
            q25 = quantile(DepDelay, probs = 0.25, na.rm = TRUE),
            q50 = quantile(DepDelay, probs = 0.5, na.rm = TRUE),
            q75 = quantile(DepDelay, probs = 0.75, na.rm = TRUE))

hou_mon %>%
  ggplot(aes(DepHour, DepDelay)) + 
  geom_point() +
  #geom_line(aes(group = Date)) + 
  ylab("Average delay (mins)") +
  xlab("Departure time") +
  scale_x_continuous(breaks = seq(0, 24, 6),
                     labels = c("midnight", "6am", "noon", "6pm", "midnight")) +
  theme_bw(18)

## Warning: Removed 156 rows containing missing values (`geom_point()`).

5 Case Study 5: Temperature by Cities

library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(tidyverse)
library(lubridate)
t = read_csv("data/temperature_airport.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 141294 Columns: 61

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): STATION, NAME, FMTM, PGTM
## dbl  (45): LATITUDE, LONGITUDE, ELEVATION, ACMH, ACSH, AWND, GAHT, PRCP, PSU...
## lgl  (11): ACMC, ACSC, DAPR, FRGT, MDPR, WT10, WT17, WT19, WV01, WV07, WV20
## date  (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

t2 = t %>%
  mutate(YEAR = year(DATE),
         MONTH = month(DATE),
         DAY = day(DATE),
         MONTHDAY = paste(ifelse(MONTH>=10, MONTH, paste("0",MONTH,sep="")), 
                          ifelse(  DAY>=10,   DAY, paste("0",  DAY,sep="")), sep="-"),
         NAME = case_when(NAME == "CHICAGO OHARE INTERNATIONAL AIRPORT, IL US" ~ "ORD", 
                          NAME == "CINCINNATI NORTHERN KENTUCKY INTERNATIONAL AIRPORT, KY US" ~ "CVG", 
                          NAME == "JFK INTERNATIONAL AIRPORT, NY US" ~ "JFK", 
                          NAME == "MIAMI INTERNATIONAL AIRPORT, FL US" ~ "MIA", 
                          NAME == "SAN FRANCISCO INTERNATIONAL AIRPORT, CA US" ~ "SFO",
                          NAME == "MINNEAPOLIS ST. PAUL INTERNATIONAL AIRPORT, MN US" ~ "MSP")) %>% 
  group_by(NAME, MONTHDAY) %>%
  summarize(TEMP = mean(TAVG, na.rm = TRUE) ) %>%
  ungroup()

## `summarise()` has grouped output by 'NAME'. You can override using the
## `.groups` argument.

t2

## # A tibble: 1,830 × 3
##    NAME  MONTHDAY  TEMP
##    <chr> <chr>    <dbl>
##  1 CVG   01-01     34.2
##  2 CVG   01-02     36.1
##  3 CVG   01-03     35.4
##  4 CVG   01-04     33.4
##  5 CVG   01-05     30  
##  6 CVG   01-06     27.8
##  7 CVG   01-07     25.6
##  8 CVG   01-08     29.8
##  9 CVG   01-09     32.9
## 10 CVG   01-10     32.6
## # ℹ 1,820 more rows

ggplot(t2) +
  geom_line(aes(x = MONTHDAY, y = TEMP, group = NAME, color = NAME)) +
  geom_smooth(aes(x = MONTHDAY, y = TEMP, group = NAME, color = NAME), method = "gam") +
  scale_x_discrete(breaks = c("01-01", "04-01", "07-01", "10-01"),
                   labels = c("01-01", "04-01", "07-01", "10-01")) +
  scale_y_continuous(limits = c(0,100)) +
  coord_polar()

## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

6 Case Study: Gapminder Data Set

This case study is originally developed in Irizarry (2019).

We have made some modifications to the case, but the majority of the analysis follows their book. We will analyze the gapminder data set¹ which is about the statistics of different countries measured over the years.

6.1 Motivation

The data set is created by Hans Rosling², a professor of international health and a pioneer in data visualization. Using this data set and the visualization, he wants to educate the public about the common misunderstanding of the world and different countries. For this initiative, he has created the Gapminder Foundation³ which uses data to show how actual trends in health and economics contradict the narratives that emanate from sensationalist media coverage of catastrophes, tragedies, and other unfortunate events. Gapminder identifies systematic misconceptions about important global trends and proportions and uses reliable data to develop easy to understand teaching materials to rid people of their misconceptions. Its mission is to fight devastating ignorance with a fact-based worldview everyone can understand.

Journalists and lobbyists tell dramatic stories. That’s their job. They tell stories about extraordinary events and unusual people. The piles of dramatic stories pile up in peoples’ minds into an over-dramatic worldview and strong negative stress feelings: “The world is getting worse!”, “It’s we vs. them!”, “Other people are strange!”, “The population just keeps growing!” and “Nobody cares!” Hans Rosling conveyed actual data-based trends in a dramatic way of his own, using effective data visualization. This section is based on two talks that exemplify this approach to education: New Insights on Poverty⁴ and The Best Stats You’ve Ever Seen⁵.

On the Gapminder website, there are a list of questions about the wealth health and economics. You can go through these questions and test your knowledge and understanding. In his TED talk⁶, Hans Rosling tested the audience with questions on the world health such as the child mortality across different countries. For example, for each pair of countries below, which country do you think had the highest child mortality rates in 2015?

Sri Lanka or Turkey
Poland or South Korea
Malaysia or Russia
Pakistan or Vietnam
Thailand or South Africa

When answering these questions without data, the non-European countries are typically picked as having higher child mortality rates: Sri Lanka over Turkey, South Korea over Poland, and Malaysia over Russia. It is also common to assume that countries considered to be part of the developing world: Pakistan, Vietnam, Thailand, and South Africa, have similarly high mortality rates.

To answer these questions, we use data, we can use dplyr. For example, for the first comparison we see that:

gapminder = read.csv("data/gapminder.csv",header = TRUE)
gapminder %>% 
  filter(year == 2015 & country %in% c("Sri Lanka","Turkey")) %>% 
  select(country, infant_mortality)

##     country infant_mortality
## 1 Sri Lanka              8.4
## 2    Turkey             11.6

Turkey has the higher infant mortality rate. We can use this code on all comparisons and find the following:

country	infant mortality	country	infant mortality
Sri Lanka	8.4	Turkey	11.6
Poland	4.5	South Korea	2.9
Malaysia	6.0	Russia	8.2
Pakistan	65.8	Vietnam	17.3
Thailand	10.5	South Africa	33.6

We see that the European countries on this list have higher child mortality rates: Poland has a higher rate than South Korea, and Russia has a higher rate than Malaysia. We also see that Pakistan has a much higher rate than Vietnam, and South Africa has a much higher rate than Thailand. It turns out that when Hans Rosling gave this quiz to educated groups of people, the average score was less than 2.5 out of 5, worse than what they would have obtained had they guessed randomly. This implies that more than ignorant, we are misinformed. In this chapter we see how data visualization helps inform us.

Specifically, in this section, we use data to attempt to answer the following two questions:

Is it a fair characterization of today’s world to say it is divided into western rich nations and the developing world in Africa, Asia, and Latin America?
Has income inequality across countries worsened during the last 40 years?

6.2 Data Import

We start by importing the data to R from the csv file.

gapminder = read.csv("data/gapminder.csv",header = TRUE)
dim(gapminder)

## [1] 10545     9

head(gapminder)

##               country year infant_mortality life_expectancy fertility
## 1             Albania 1960           115.40           62.87      6.19
## 2             Algeria 1960           148.20           47.50      7.65
## 3              Angola 1960           208.00           35.98      7.32
## 4 Antigua and Barbuda 1960               NA           62.97      4.43
## 5           Argentina 1960            59.87           65.39      3.11
## 6             Armenia 1960               NA           66.86      4.55
##   population          gdp continent          region
## 1    1636054           NA    Europe Southern Europe
## 2   11124892  13828152297    Africa Northern Africa
## 3    5270844           NA    Africa   Middle Africa
## 4      54681           NA  Americas       Caribbean
## 5   20619075 108322326649  Americas   South America
## 6    1867396           NA      Asia    Western Asia

As we can see there are 9 variables and 10545 observations. Each row represents one particular country in a particular year. These 9 variables include:

country: name of the country. There are 185 countries in total.
year: the year of observation. It ranges from 1960 to 2016.
infant_mortality: the infant mortality of the country in that year.
life_expectancy: the life expectancy of the country in that year.
fertility: the fertility rate of the country in that year.
population: the population rate of the country in that year.
gdp: the GDP of the country in that year.
continent: the continent the country belongs to. Values include Europe, Africa, Americas, Asia, and Oceania.
region: the region the country belongs to the following values

unique(gapminder$region)

##  [1] "Southern Europe"           "Northern Africa"          
##  [3] "Middle Africa"             "Caribbean"                
##  [5] "South America"             "Western Asia"             
##  [7] "Australia and New Zealand" "Western Europe"           
##  [9] "Southern Asia"             "Eastern Europe"           
## [11] "Central America"           "Western Africa"           
## [13] "Southern Africa"           "South-Eastern Asia"       
## [15] "Eastern Africa"            "Northern America"         
## [17] "Eastern Asia"              "Northern Europe"          
## [19] "Melanesia"                 "Polynesia"                
## [21] "Central Asia"              "Micronesia"

unique(gapminder$year)

##  [1] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
## [16] 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
## [31] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
## [46] 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

6.3 Missing Data

We check to see if there are any missing data and how the missing values distribute in the data.

colSums(is.na(gapminder))

##          country             year infant_mortality  life_expectancy 
##                0                0             1453                0 
##        fertility       population              gdp        continent 
##              187              185             2972                0 
##           region 
##                0

As we can see, there are a lot of missing observations for the infant mortality rate, fertility rate, population, and gdp.

missing_by_year = gapminder %>% 
  group_by(year) %>%
  summarize(num_missing_infant_mortality = sum(is.na(infant_mortality)),
            num_missing_fertility = sum(is.na(fertility)),
            num_missing_population = sum(is.na(population)),
            num_missing_gdp = sum(is.na(gdp)))
missing_year_long = pivot_longer(missing_by_year, cols = 2:5, names_to = "variable", values_to = "missing_num")
ggplot(missing_year_long)+
  geom_bar(aes(year,missing_num),width=0.5, stat = "identity")+ # geom_col()
  facet_wrap(~variable)+
  scale_x_continuous(breaks = seq(1960,2010,10))

On the other hand, we can look at the missing values for each country.

missing_by_coun = gapminder %>% 
  group_by(country) %>%
  summarize(num_missing_infant_mortality = sum(is.na(infant_mortality)),
            num_missing_fertility = sum(is.na(fertility)),
            num_missing_population = sum(is.na(population)),
            num_missing_gdp = sum(is.na(gdp)))
missing_by_coun %>% 
  arrange(desc(num_missing_infant_mortality)) %>%
  select(country,num_missing_infant_mortality) %>%
  head(10)

## # A tibble: 10 × 2
##    country             num_missing_infant_mortality
##    <chr>                                      <int>
##  1 Aruba                                         57
##  2 French Polynesia                              57
##  3 Greenland                                     57
##  4 Hong Kong, China                              57
##  5 Macao, China                                  57
##  6 New Caledonia                                 57
##  7 Puerto Rico                                   57
##  8 Antigua and Barbuda                           31
##  9 Grenada                                       25
## 10 Montenegro                                    25

missing_by_coun %>% 
  arrange(desc(num_missing_fertility)) %>%
  select(country,num_missing_fertility) %>%
  head(10)

## # A tibble: 10 × 2
##    country             num_missing_fertility
##    <chr>                               <int>
##  1 Greenland                               3
##  2 Albania                                 1
##  3 Algeria                                 1
##  4 Angola                                  1
##  5 Antigua and Barbuda                     1
##  6 Argentina                               1
##  7 Armenia                                 1
##  8 Aruba                                   1
##  9 Australia                               1
## 10 Austria                                 1

missing_by_coun %>% 
  arrange(desc(num_missing_population)) %>%
  select(country,num_missing_population) %>%
  head(10)

## # A tibble: 10 × 2
##    country             num_missing_population
##    <chr>                                <int>
##  1 Albania                                  1
##  2 Algeria                                  1
##  3 Angola                                   1
##  4 Antigua and Barbuda                      1
##  5 Argentina                                1
##  6 Armenia                                  1
##  7 Aruba                                    1
##  8 Australia                                1
##  9 Austria                                  1
## 10 Azerbaijan                               1

missing_by_coun %>% 
  arrange(desc(num_missing_gdp)) %>%
  select(country,num_missing_gdp) %>%
  head(10)

## # A tibble: 10 × 2
##    country            num_missing_gdp
##    <chr>                        <int>
##  1 Libya                           46
##  2 Ireland                         45
##  3 Qatar                           45
##  4 West Bank and Gaza              45
##  5 Timor-Leste                     44
##  6 Iraq                            42
##  7 Montenegro                      42
##  8 Aruba                           40
##  9 Estonia                         40
## 10 Maldives                        40

6.4 Divided World

Now we begin to analyze the data set. There is a common understanding about the world that is the world is divided into two groups: the developed countries which are associate with long life spans and small families, versus the developing countries (Africa, Asia, and Latin America) characterized by short life spans and large families. Is this true? To verify this, we propose our first question:

Is the world divided in terms of life expectancy and fertility rate (average number of children per woman)?

To answer this question, we use scatterplot of life expectancy versus fertility rate. We start by looking at data from about 50 years ago, when perhaps this view was first cemented in our minds.

library(RColorBrewer)
gapminder %>%
  filter(year == 1962) %>% # data wrangling
  ggplot(aes(fertility, life_expectancy, color = continent)) + # data visualization
  geom_point() +
  scale_color_brewer(palette = "Paired")

As we can see, most countries fall into two distinct categories: 1. Life expectancy around 70 years and 3 or fewer children per family. 2. Life expectancy lower than 65 years and more than 5 children per family. Therefore, it seems this understanding does not have some support from data and is somewhat true 50 years ago. However, is it still true today? In addition, is the situation different than 50 years ago? To answer this question, we need to regenerate a new figure for 2012 and compare with the previous one.

gapminder %>%
  filter(year %in% c(1962, 2012)) %>%
  ggplot(aes(fertility, life_expectancy, color = continent)) +
  geom_point() +
  facet_wrap(~year) +
  scale_color_brewer(palette = "Paired")

This figure shows the totality of the situation. If we want to focus on different continent, we can further break it down the comparison. As we can see, the most improvement takes place in Africa, Americas, and Asia.

gapminder %>%
  filter(year %in% c(1962, 2012)) %>%
  ggplot(aes(fertility, life_expectancy, color = continent)) +
  geom_point() +
  facet_grid(year~continent) +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "top")

This plot clearly shows that the majority of countries have moved from the developing world cluster to the western world one. In conclusion, the belief that the world is divided into two groups are no longer support nowadays. This is particularly true for Asia.

The figure focus on the overall improvement of the entire world or each continent. What is we want to focus on each country? We can plot the trajectory of each country.

gapminder %>%
  filter(year %in% c(1962, 2012)) %>%
  group_by(country) %>%
  summarise(start_fert = fertility[year==1962],
            end_fert = fertility[year==2012],
            start_life = life_expectancy[year==1962],
            end_life = life_expectancy[year==2012],
            continent = continent) %>%
  ggplot() +
  geom_segment(aes(x=start_fert,xend=end_fert,
                   y=start_life,yend=end_life,
                   color=continent,
                   group=country)) +
  facet_wrap(~continent) +
  scale_color_brewer(palette = "Paired")

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.

As we can see, Asian and Americas countries collectively improved significantly. Part of African countries improved significantly in both life expectancy and fertility, but the others have only moderate improvements and these improvement are mostly life expectancy. Most of the European countries have mild improvements.

We are looking at the span of 50 years, it make sense to look at the data most closely and visualize how the changes took place through the years. We can add 1970, 1980, 1990, and 2000. If we do this, we will not want all the plots on the same row, the default behavior of facet_grid, since they will become too thin to show the data. Instead, we will want to use multiple rows and columns. The function facet_wrap permits us to do this by automatically wrapping the series of plots so that each display has viewable dimensions:

gapminder %>% 
  filter(year %in% c(1962, 1980, 1990, 2000, 2012)) %>%
  ggplot( aes(fertility, life_expectancy, color = continent)) +
  geom_point() +
  facet_wrap(~year, ncol=5) +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "top")

gapminder %>% 
  filter(year %in% seq(1962, 2012, 1)) %>%
  ggplot() +
  geom_path(aes(fertility, life_expectancy, 
                color=continent, group = country)) +
  #geom_text_repel(aes(fertility, life_expectancy, label=year)) +
  facet_wrap(~continent) +
  scale_color_brewer(palette = "Paired")

This plot clearly shows how most Asian countries have improved at a much faster rate than European ones. Meanwhile, an interesting pattern is that most countries improve the life expectancy first and then reduce the fertility rate. This makes sense because the increase of life expectancy implies the improvement of health care quality. Once the health care condition is improved, the family does not need give birth to many children.

The default choice of the range of the axes is important. When not using facet, this range is determined by the data shown in the plot. When using facet, this range is determined by the data shown in all plots and therefore kept fixed across plots. This makes comparisons across plots much easier. For example, in the above plot, we can see that life expectancy has increased and the fertility has decreased across most countries. We see this because the cloud of points moves. This is not the case if we adjust the scales: In the plot above, we have to pay special attention to the range to notice that the plot on the right has a larger life expectancy.

filter(gapminder, year%in%c(1962, 2012)) %>%
  ggplot(aes(fertility, life_expectancy, col = continent)) +
  geom_point() +
  facet_wrap(. ~ year, scales = "free")

We can repeat the same analysis using another pair of variables.

gapminder %>%
  filter(year %in% c(1962, 2011)) %>% # data wrangling
  mutate(income = gdp/population) %>%
  ggplot(aes(income, life_expectancy, color = continent)) + # data visualization
  scale_x_log10()+
  geom_point() +
  facet_wrap(~year) +
  scale_color_brewer(palette = "Paired")

## Warning: Removed 106 rows containing missing values (`geom_point()`).

6.5 Income Inequality

Another commonly held belief is that the wealth inequality has became more serious over the years with the rich getting richer and poor getting poorer. We now shift our attention to the second question related to the commonly held notion that wealth distribution across the world has become worse during the last decades.

Did the income inequality in the world become worse or better?

When general audiences are asked if poor countries have become poorer and rich countries become richer, the majority answers yes. By using stratification, histograms, smooth densities, and boxplots, we will be able to understand if this is in fact the case. First we learn how transformations can sometimes help provide more informative summaries and plots.

We use the GDP per capita as a measure of individual wealth and create this variable in the data set. Note that income is adjusted for inflation and represent current US dollars, so it is comparable across the years. In addition, this variable represents the average income of that country. There are certainly variation in income within that country, which is not what we are studing right now because we do not have assess to this data.

gapminder <- gapminder %>%  
  mutate(dollars_per_day = gdp/population/365,
         gdp_per_capita = gdp/population)

Here is a histogram of GDP per capita from 1970.

past_year <- 1970
# gapminder %>% 
#   filter(year == past_year & !is.na(gdp_per_capita)) %>%
#   ggplot(aes(dollars_per_day)) + 
#   geom_histogram(binwidth = 1, color = "black")
gapminder %>% 
  filter(year == past_year & !is.na(gdp_per_capita)) %>%
  ggplot(aes(gdp_per_capita)) + 
  geom_histogram(binwidth = 1000)

Because of the skeness of the data, the visualization mainly displays the countries with gdp per capita above $4000. So we are unclear about the distribution of the gdp per capita for the rest of the countries, even though they are the majority in the data.

To address this issue, we consider the log transformation which convert the scale into a multiplicative order.

In addition to log base 10 transformation. Other common choices are base $\mathrm{e}$ (the natural log) and base 2.

In general, the natural log should be avoided just because $\mathrm{e}^2, \mathrm{e}^3, \dots$ are harder and less intuitive to calculate. In comparison, log base 2 and log base 10 are better choices, because $2^2, 2^3, 2^4, \dots$ or $10^2, 10^3, \dots$ are easy to compute in our heads.

In the dollars per day example, we used base 2 instead of base 10 because the resulting range is easier to interpret. The range of the values being plotted is 119.3340388, 1.7843103^{4}.

In base 10, this turns into a range that includes very few integers: just 0 and 1. With base two, our range includes -2, -1, 0, 1, 2, 3, 4, and 5. It is easier to compute $2^x$ and $10^x$ when $x$ is an integer and between -10 and 10, so we prefer to have smaller integers in the scale. Another consequence of a limited range is that choosing the binwidth is more challenging. With log base 2, we know that a binwidth of 1 will translate to a bin with range $x$ to $2x$.

For an example in which base 2 makes more sense, consider the daily gdp per capita which equals to gdp per capita divided by 365.

There are two ways to transform the data for visualization. - Transform the values before plotting them, i.e., aes(x=log10(gdp_per_capita)). I1n this case, the axis ticks display the transformed values such as 1, 2, 3, … - Use transformed scales in the axes, i.e., +scale_x_continuous(trans="log10"). In this case, the axis ticks display the original values such as 10, 100, 1000, … Both approaches are useful and have different strengths. Usually, the latter gives a better visualization of the data.

There are other transformations available through the trans argument.

The square root (sqrt) transformation is useful when considering counts.
The logistic transformation (logit) is useful when plotting proportions between 0 and 1.
The reverse transformation is useful when we want smaller values to be on the right or on top.

If we use the log transformation with base of 10, then we can easily show $100 (extremely poor), $500 (very poor), $1000 (poor), $5000 (middle), $10000 (well off), $50000 (rich), $100000 (very rich). This transformation provides a close-up of the mid to lower income countries. Here is the distribution if we apply a log base 10 transform:

g1=gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  ggplot(aes(x=gdp_per_capita)) + 
  geom_histogram(binwidth = .18)+
  scale_x_continuous(trans="log10")
g2=gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  ggplot(aes(gdp_per_capita)) + 
  geom_histogram(binwidth =0.7) +
  scale_x_continuous(trans="log2")
grid.arrange(g1,g2,ncol=2)

We see multiple bumps in the histogram. This is often referred to as multimodal distribution. The locations where the density reaches local maximum are called local modes.

The histogram above suggests that the 1970 country income distribution has two modes: one at about 1000 dollars and another at about 10000 dollars. This bimodality is consistent with a dichotomous world made up of countries with average incomes less than $5000 and countries above that. However, which countries consist of these two groups?

In addition to the histogram, we can visualize the same data using density plot.

6.5.1 Comparison among Regions

A histogram shows that the 1970 income has a multimodal distirbution which implies that the there are multiple clusters of countries with similar incomes and the clusters are separated from each other. However, we still do not know what are these clusters. Do these clusters represent the developed world and devleoping world? We will learn about clustering analysis in the later chapters. For now, we can manually divide the countries according to their geographical locations, such as continent and region.

gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  ggplot(aes(gdp_per_capita, fill=continent)) + 
  geom_histogram(binwidth =.18) +
  scale_x_continuous(trans="log10") +
  scale_fill_brewer(palette = "Paired")

Let’s start by quickly examining the data by region. We reorder the regions by the median value and use a log scale.

gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  ggplot(aes(gdp_per_capita, reorder(continent,gdp_per_capita), fill=continent)) +
  geom_boxplot(outlier.shape = NA) + 
  geom_jitter(height=0.1, shape = 4) +
  scale_x_continuous(trans = "log10") +
  annotation_logticks(sides = "b") +
  scale_fill_brewer(palette = "Paired") +
  theme(legend.position = "top")

gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  mutate(region = reorder(region, gdp_per_capita, FUN = median)) %>%
  ggplot(aes(gdp_per_capita, region, 
             fill=continent, 
             label=country)) +
  geom_boxplot(outlier.shape = NA) + 
  geom_point(shape = 4) +
  geom_text_repel(color="blue", size=2)+
  scale_x_continuous(trans = "log10") +
  annotation_logticks(sides = "b") +
  scale_fill_brewer(palette = "Paired") +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "top")

## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

gapminder %>% 
  filter(year == past_year, 
         !is.na(gdp),
         region == "Western Asia") %>%
  select(country, region, gdp_per_capita)

##        country       region gdp_per_capita
## 1      Georgia Western Asia      1003.6948
## 2       Israel Western Asia     10058.2650
## 3         Oman Western Asia      4328.5134
## 4 Saudi Arabia Western Asia      7551.5971
## 5        Syria Western Asia       576.6997
## 6       Turkey Western Asia      2163.1682

We can already see that there is indeed a “west versus the rest” dichotomy: we see two clear groups, with the rich group composed of North America, Northern and Western Europe, New Zealand and Australia. We define groups based on this observation. We turn this group variable into a factor to control the order of the levels.

gapminder <- gapminder %>% 
  mutate(group = case_when(
    region %in% c("Northern Europe",
                  "Northern America",
                  "Australia and New Zealand",
                  "Western Europe",
                  "Southern Europe")  ~ "Developed",
    country %in% c("Japan", "Israel", "Hong Kong, China", "French Polynesia", "") ~ "Developed",
    region %in% c("Eastern Asia", 
                  "South-Eastern Asia",
                  "Southern Asia",
                  "Western Asia") ~ "Asia",
    region %in% c("Caribbean", 
                  "Central America", "South America") ~ "Latin America",
    continent == "Africa" ~ "Africa",
    TRUE ~ "Others"))
gapminder <- gapminder %>% 
  mutate(group = factor(group, levels = c("Others","Latin America", "Asia", "Africa", "Developed")))

The exploratory data analysis above has revealed two characteristics about average income distribution in 1970. Using a histogram, we found a bimodal distribution with the modes relating to poor and rich countries. We now want to compare the distribution across these five groups to confirm the “west versus the rest” dichotomy. The number of points in each category is large enough that a summary plot may be useful. We could generate five histograms or five density plots, but it may be more practical to have all the visual summaries in one plot. We therefore start by stacking boxplots next to each other. Note that we add the layer theme(axis.text.x = element_text(angle = 90, hjust = 1)) to turn the group labels vertical, since they do not fit if we show them horizontally, and remove the axis label to make space. Boxplots have the limitation that by summarizing the data into five numbers, we might miss important characteristics of the data. One way to avoid this is by showing the data.

g1=gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  ggplot(aes(gdp_per_capita, reorder(group,gdp_per_capita), fill=group)) +
  geom_boxplot() +
  geom_jitter(height = 0.2, shape=4) +
  scale_x_continuous(trans = "log10") +
  scale_fill_brewer(palette = "Dark2")
g2=gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  ggplot(aes(gdp_per_capita, fill=group)) + 
  geom_histogram(binwidth = .18) +
  scale_x_continuous(trans="log10") +
  scale_fill_brewer(palette = "Dark2")
grid.arrange(g1,g2,ncol=2)

The jitter scatter plot shows the location of each individual point. This does not always reveal important characteristics of the distribution. when the number of data points is so large that there is over-plotting, showing the data can be counterproductive.

The sistogram shows the interval each data point belongs to. This is less information than the scatter plot, but still sometimes can be overwhelming, and masks the true pattern in the data.

On the other hand, the boxplot further summarizes the data by providing a five-number summary, but this sometimes can be a over simplification of the data. For example, the boxplot does not permit us to discover bimodal distributions.

Apparenly, we need something in between, which is the density plot. We can show stacked smooth densities or histograms. We refer to these as ridge plots. Because we are used to visualizing densities with values in the x-axis, we stack them vertically. Also, because more space is needed in this approach, it is convenient to overlay them. A useful geom_density_ridges parameter is scale, which lets you determine the amount of overlap, with scale = 1 meaning no overlap and larger values resulting in more overlap.

Similar to the box plot above, we can add data to the ridge plot. The height of the points is jittered and should not be interpreted in any way. Alternatively, we could the rug representation of the data.

library(ggridges)
gapminder %>% 
  filter(year == past_year & !is.na(gdp_per_capita)) %>%
  ggplot(aes(gdp_per_capita, group, color=group)) + 
  geom_density_ridges(scale=1, jittered_points = TRUE) +
  geom_density_ridges(scale=1, jittered_points = TRUE, 
                      position = position_points_jitter(height = 0),
                      point_shape = '|', point_size = 3, 
                      point_alpha = 1, alpha = 0.7) +
  scale_x_continuous(trans = "log10") + 
  scale_fill_brewer(palette = "Dark2")

gapminder %>% 
  filter(year == past_year & !is.na(gdp_per_capita)) %>%
  ggplot(aes(gdp_per_capita, fill=group)) + 
  geom_density(alpha = 0.5, position="stack", bw = 0.2) +
  scale_x_continuous(trans = "log10", limit = c(30, 150000)) +
  scale_fill_brewer(palette = "Dark2")

6.5.2 Comparison between Developed and Developing Countries

Now versus then for income inequality

Visualization shows that there are two clusters in income. But does this dichotomy persist? Let’s use facet_grid see how the distributions have changed. To start, we will focus on two groups: the west and the rest. We make four histograms.

Before we interpret the findings of this plot, we notice that there are more countries represented in the 2010 histograms than in 1970: the total counts are larger. One reason for this is that several countries were founded after 1970. For example, the Soviet Union divided into several countries during the 1990s. Another reason is that data was available for more countries in 2010.

We remake the plots using only countries with data available for both years. In the data wrangling part of this book, we will learn tidyverse tools that permit us to write efficient code for this, but here we can use simple code using the intersect function:

past_year <- 1970
present_year <- 2010
years <- c(past_year, present_year)
country_list_1 <- gapminder %>% 
  filter(year == past_year & !is.na(gdp_per_capita)) %>% 
  pull(country)
country_list_2 <- gapminder %>% 
  filter(year == present_year & !is.na(gdp_per_capita)) %>% 
  pull(country)
country_list <- intersect(country_list_1, country_list_2)

These 108 account for 86% of the world population, so this subset should be representative.

Let’s remake the plot, but only for this subset by simply adding country %in% country_list to the filter function:

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(west = ifelse(group == "Developed", "Developed", "Developing")) %>%
  ggplot(aes(gdp_per_capita)) +
  geom_histogram(binwidth = 0.18, color = "black") +
  scale_x_continuous(trans = "log10") + 
  facet_grid(year ~ west)

We now see that the rich countries have become a bit richer, but percentage-wise, the poor countries appear to have improved more. In particular, we see that the proportion of developing countries earning more than $16 a day increased substantially.

To see which specific regions improved the most, we can remake the boxplots we made above, but now adding the year 2010 and then using facet to compare the two years.

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  ggplot(aes(group, gdp_per_capita)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(trans = "log10") +
  xlab("") +
  facet_grid(. ~ year)

Here, we pause to introduce another powerful ggplot2 feature. Because we want to compare each region before and after, it would be convenient to have the 1970 boxplot next to the 2010 boxplot for each region. In general, comparisons are easier when data are plotted next to each other.

So instead of faceting, we keep the data from each year together and ask to color (or fill) them depending on the year. Note that groups are automatically separated by year and each pair of boxplots drawn next to each other. Because year is a number, we turn it into a factor since ggplot2 automatically assigns a color to each category of a factor. Note that we have to convert the year columns from numeric to factor.

g1 = gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(year = factor(year)) %>%
  ggplot(aes(group, gdp_per_capita, fill = year)) +
  geom_boxplot() +
  scale_y_continuous(trans = "log10") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2 = gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(year = factor(year)) %>%
  ggplot(aes(continent, gdp_per_capita, fill = year)) +
  geom_boxplot() +
  scale_y_continuous(trans = "log10") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
g3 = gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(year = factor(year)) %>%
  ggplot(aes(region, gdp_per_capita, fill = year)) +
  geom_boxplot() +
  scale_y_continuous(trans = "log10") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
plots1=list(g1,g2,g3)
lay1 = rbind(c(1, 2), c(3, 3)) 
grid.arrange(grobs = plots1,
             layout_matrix = lay1)

Finally, we point out that if what we are most interested in is comparing before and after values, it might make more sense to plot the percentage increases. We are still not ready to learn to code this, but here is what the plot would look like:

g1 = gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(year = ifelse(year == past_year, "past", "present")) %>%
  select(country, group, year, gdp_per_capita) %>%
  spread(year, gdp_per_capita)  %>%
  mutate(percent_increase = (present-past)/past*100) %>%
  mutate(group = reorder(group, percent_increase, FUN = median)) %>%
  ggplot(aes(group, percent_increase, fill=group)) +
    geom_boxplot() + 
    geom_point(show.legend = FALSE, shape=4) +
    scale_y_continuous(trans = "log10") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    ylab(paste("Percent increase:", past_year, "to", present_year)) +
    scale_fill_brewer(palette = "Dark2")
g2 = gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(year = ifelse(year == past_year, "past", "present")) %>%
  select(country, region, year, gdp_per_capita, continent) %>%
  spread(year, gdp_per_capita)  %>%
  mutate(percent_increase = (present-past)/past*100) %>%
  mutate(continent = reorder(continent, 
                             percent_increase, FUN = median)) %>%
  ggplot(aes(continent, percent_increase, fill = continent)) +
    geom_boxplot() + 
    geom_jitter(width=0.1,shape=4, show.legend = FALSE) +
    scale_y_continuous(trans = "log10") +
    scale_fill_brewer(palette = "Paired") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab("") + 
    ylab(paste("Percent increase:", past_year, "to", present_year)) 
g3 = gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(year = ifelse(year == past_year, "past", "present")) %>%
  select(country, region, year, gdp_per_capita, continent) %>%
  spread(year, gdp_per_capita)  %>%
  mutate(percent_increase = (present-past)/past*100) %>%
  mutate(region = reorder(region, 
                          percent_increase, FUN = median)) %>%
  ggplot(aes(region, percent_increase, fill = continent)) +
    geom_boxplot() + 
    geom_jitter(width=0.1,shape=4, show.legend = FALSE) +
    scale_y_continuous(trans = "log10") +
    scale_fill_brewer(palette = "Paired") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab("") + 
    ylab(paste("Percent increase:", past_year, "to", present_year)) 
plots1=list(g1,g2,g3)
lay1 = rbind(c(1, 2), c(3, 3)) 
grid.arrange(grobs = plots1,
             layout_matrix = lay1)

The previous data exploration suggested that the income gap between rich and poor countries has narrowed considerably during the last 40 years. We used a series of histograms and boxplots to see this. We suggest a succinct way to convey this message with just one plot.

Summary

Let’s start by noting that density plots for income distribution in 1970 and 2010 deliver the message that the gap is closing:

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  ggplot(aes(gdp_per_capita)) +
    geom_density(fill = "grey") + 
    scale_x_continuous(trans = "log10", limit = c(30, 150000), 
                       breaks=c(100,1000,10000,100000),
                       labels=c("100","1000","10000","100000")) + 
    facet_grid(. ~ year)

In the 1970 plot, we see two clear modes: poor and rich countries. In 2010, it appears that some of the poor countries have shifted towards the right, closing the gap.

The next message we need to convey is that the reason for this change in distribution is that several poor countries became richer, rather than some rich countries becoming poorer. To do this, we can assign a color to the groups we identified during data exploration.

However, we first need to learn how to make these smooth densities in a way that preserves information on the number of countries in each group. To understand why we need this, note the discrepancy in the size of each group:

Developed	Developing
24	84

But when we overlay two densities, the default is to have the area represented by each distribution add up to 1, regardless of the size of each group. This makes it appear as if there are the same number of countries in each group. To change this, we will need to learn to access computed variables with geom_density function.

To have the areas of these densities be proportional to the size of the groups, we can simply multiply the y-axis values by the size of the group. From the geom_density help file, we see that the functions compute a variable called count that does exactly this. We want this variable to be on the y-axis rather than the density.

In ggplot2, we access these variables by surrounding the name with two dots. We will therefore use the following mapping. If we want the densities to be smoother, we use the bw argument so that the same bandwidth is used in each density. We selected 0.2 after trying out several values.

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(group = ifelse(group == "Developed", "Developed", "Developing")) %>%
  ggplot(aes(x = gdp_per_capita, y = ..count.., fill = group)) +
    scale_x_continuous(trans = "log10", limit = c(30,100000)) + 
    geom_density(alpha = 0.2, bw = 0.2) + 
    facet_grid(year ~ .)

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This plot now shows what is happening very clearly. The developing world distribution is changing. A third mode appears consisting of the countries that most narrowed the gap.

To visualize if any of the groups defined above are driving this we can quickly make a ridge plot:

gapminder %>% 
  filter(year %in% years & !is.na(gdp_per_capita)) %>%
  ggplot(aes(gdp_per_capita, group)) + 
    geom_density_ridges(adjust = 1.5) +
    scale_x_continuous(trans = "log10", limit = c(30,100000)) + 
    facet_grid(. ~ year)

Another way to achieve this is by stacking the densities on top of each other:

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  group_by(year) %>%
  mutate(weight = population/sum(population)) %>%
  ungroup() %>%
  ggplot(aes(gdp_per_capita, fill = group)) +
    geom_density(alpha = 0.2, bw = 0.2, position = "stack") + 
    scale_x_continuous(trans = "log10", limit = c(30, 150000)) + 
    facet_grid(year ~ .)

Here we can clearly see how the distributions for East Asia, Latin America, and others shift markedly to the right. While Sub-Saharan Africa remains stagnant.

Notice that we order the levels of the group so that the West’s density is plotted first, then Sub-Saharan Africa. Having the two extremes plotted first allows us to see the remaining bimodality better.

Weighted densities

As a final point, we note that these distributions weigh every country the same. So if most of the population is improving, but living in a very large country, such as China, we might not appreciate this. We can actually weight the smooth densities using the weight mapping argument. The plot then looks like this:

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  group_by(year) %>%
  mutate(weight = population/sum(population)) %>%
  ungroup() %>%
  ggplot(aes(gdp_per_capita, fill = group, weight = weight)) +
    geom_density(alpha = 0.2, bw = 0.2, position = "stack") + 
    scale_x_continuous(trans = "log10", limit = c(30, 150000)) + 
    facet_grid(year ~ .)

This particular figure shows very clearly how the income distribution gap is closing with most of the poor remaining in Sub-Saharan Africa.

References

Irizarry, Rafael A. 2019. Introduction to Data Science: Data Analysis and Prediction Algorithms with r. 1st ed. Chapman; Hall/CRC. https://rafalab.github.io/dsbook/.

Ch5 Visualization Case Study

Descriptive Analytics and Data Visualization

Yichen Qin (qinyn@ucmail.uc.edu), University of Cincinnati

2024-12-20

Ch5 Visualization Case Study

1 Case Study 1: Simpson’s Paradox in Baseball

2 Case Study 2: London’s Wind Speed and Direction

3 Case Study 3: Basketball Shots

4 Case Study 4:

5 Case Study 5: Temperature by Cities

6 Case Study: Gapminder Data Set

6.1 Motivation

6.2 Data Import

6.3 Missing Data

6.4 Divided World

6.5 Income Inequality

6.5.1 Comparison among Regions

6.5.2 Comparison between Developed and Developing Countries

References